This link may be of interest. It is about reconstructing a python file from its parse tree. Maybe a few changes to the code generator would do the job:
Reconstruct Python <https://lark-parser.readthedocs.io/en/latest/examples/advanced/reconstruct_python.html> On Friday, December 10, 2021 at 9:29:46 AM UTC-5 [email protected] wrote: > As I understand it, the Python tokenizer keeps two stacks of indents. In > one, each tab is expanded to the full 8 spaces. In the other, a tab counts > for one space. Both stacks have to agree on the indentation level at every > stage. > > When I have done the same job in the past - except I didn't need to > tokenize or parse everything the way an importer has to - to determine the > indentation level - I counted the number of tabs and spaces without regard > to order. That gives an unambiguous indent level without needing to depend > on invisible details of the permutations and expansions of tabs and > spaces. It worked well. > > Then on output of course the tabs could be replaced with four spaces. No > problem there. I dislike assuming tabs are always four spaces in the > input. It would be easy for someone to set their editor to emit, say, > three spaces per tab to get slightly more compact lines. We don't know > how often that would happen. And there could still be a few legacy files > around that use all tabs. I have found them from time to time. > > On Friday, December 10, 2021 at 7:02:54 AM UTC-5 Edward K. Ream wrote: > >> This Engineering Notebook post will discuss the difficulties that *any* >> python importer must face. To state my conclusions first: >> >> 1. Generating the proper whitespace before @others correctly in *all* >> cases requires: >> >> A: Some form of look-ahead, or equivalently, delayed code generation. >> B: What amounts to a full *parse* of def and class lines. >> >> 2. I am willing to let the importer assume 4-space indentation for >> @others in class nodes. In effect, this is what the legacy Py_Importer >> class does! >> >> *Background* >> >> Vitalije's new importer has trouble importing >> mypy/test-data/stdlib-samples/3.2/test/test_textwrap.py. The file *is* >> imported >> perfectly, but many nodes are over-indented due to missing indentation in >> `@others` directives in the class nodes. >> >> The relevant code in the mknode function is: >> >> o = indent('@others\n', ind-l_ind) >> ... >> p.b = f'{b1}{o}{b2}' >> >> Alas, the value ind-l_ind won't work in all cases! Instead, I suggest >> using the value 4 for all classes :-) That's exactly what the legacy >> importer does! >> >> Yes, this would break the strangely-indented unit tests, but I'm willing >> to live with that. >> >> *The heroic alternative* >> >> Generating the correct indentation for @others in *all* cases is much >> more difficult. Indeed, the indentation of the @others line must be the >> indentation of the *first significant line *following the class or def >> line. The first significant line is the first line that is not: >> >> - A blank or a comment. >> - In a string. >> >> The legacy Py_Importer class detects such lines fairly easily. It is the >> first non-blank, non-comment line for which Python_ScanState.in_context >> returns False: >> >> def in_context(self): >> """True if in a special context.""" >> return ( >> self.context or >> self.curlies > 0 or # Open curly brackets >> self.parens > 0 or # Open parentheses. >> self.squares > 0 or # Open square brackets >> self.bs_nl # In backslash/newline. >> ) >> >> Ironically, having gone through all this trouble, my legacy importer >> *still* assumes 4-space indentation! In theory, the importer *could* get >> the indentation right. In practice, it's dashed difficult to do so! >> >> The split_root functions (or its helpers) would *also *have to find the >> first significant line of a class! In effect, the new importer would have >> to do a full parse of the entire class or def line. >> >> *Summary* >> >> The python importer contains analogs of all the phases of an optimizing >> compiler. The incoming code must be tokenized and maybe even parsed. >> Code generation will never be easy. >> >> In class or def nodes, the leading whitespace of @others directive should >> be the leading whitespace of the first significant line of the class or >> def. Finding the first significant line of a class or def requires a full >> parse. >> >> Importers can avoid the parse phase only if they assume 4-space >> indentation! I am willing to make this concession, and I am willing to >> abandon (parts of) the unit tests for strangely-indented code. >> >> Edward >> > -- You received this message because you are subscribed to the Google Groups "leo-editor" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/leo-editor/10ede53f-5594-4a7d-97f9-b7d851de27d7n%40googlegroups.com.
