After reading this, a few things came to mind that I hadn't thought about 
before.  The big one is what should the importer do when finding incorrect 
Python code, or at least incorrect whitespace?  Should it correct it - at 
least try to fix up the whitespace? Refuse to complete the import?  One 
thing might be to treat it as if the user had typed in the text - refuse to 
write back to the external file until all the errors get fixed, but save 
the outline when asked.  

Another point is what the importer should do about mixed leading 
indentation - tabs and spaces together.  Should it convert tabs to spaces?  
Presumably it should because that's what would happen when a user tried to 
type in the same text.  I don't know what the current importer does here.

The few times in the past that I've written little Python importers, I have 
tried different tactics.  The most general way was to handle a mix of tabs 
and spaces by using the count of space and tab characters even if their 
order changed. That would amount to changing the tabs to single spaces, for 
the purposes of identifying new indentation - I used four spaces per tab 
for output.  I used the first line that contained a new indent as the 
template for that indentation level.  It worked pretty well most of the 
time and wasn't too hard to code.

The easier but less general way was to just replace all tabs with four 
spaces.  If the original file's editor used a different number of spaces 
for a tab, it might not have worked so well, although one could build in a 
little slop, so that if an indentation were say one space over or under it 
would be accepted (and fixed).

This reminds me that I have never looked into exactly how Python figures 
out the whitespace.  Might be interesting.

On Tuesday, November 23, 2021 at 6:52:18 AM UTC-5 Edward K. Ream wrote:

> According to PR #2331 <https://github.com/leo-editor/leo-editor/pull/2331>, 
> I started work on the new python importer 9 days ago.  This Engineering 
> Notebook post will discuss what I have done and the remaining difficulties.
>
>
> *vnode_info dictionary*
>
> All importers now use a *vnode_info *dict instead of injecting the 
> *_import_lines 
> *ivar into vnodes.  Keys are vnodes; values are* inner dictionaries.*
>
> The inner dictionary contains at least one key/value pair:
>
>     "lines": <list of lines for the vnode>.
>
> VNodes use slots 
> <https://docs.python.org/3/reference/datamodel.html#slots>, so the 
> vnode_info dict* slightly* reduces the descriptor memory required in all 
> vnodes. More importantly, the vnode_info dict allows the python importer to 
> contain other key/value pairs.
>
> *Stackless python importer*
>
> Previously, all importers, including the python importer, used a stack 
> that mirrored the structure of the imported nodes that the importers 
> created.  Keeping the stack in sync with created nodes is tricky. Aha! 
> Maybe the stack isn't needed! The vnode_info dict may suffice.  The python 
> importer uses an inner dict with these keys:
>
> {
>      '@others': <True: lines contains @others>,
>      'indent': <The node's indentation, see below>,
>      'kind': <one of 'outer', 'org', 'class', 'def'>,
>      'lines': < list of lines for the vnode>,
> }
>
> Instead of getting these values from the stack, the importer will get 
> these values from the generated nodes.  For example, in the main importer 
> loop the *p var *points at the node being generated. So info_dict 
> [p.parent().v] contains the data for p's parent and  info_dict 
> [p.back().v] contains the data for p's previous sibling, if any.
>
> I *think* this new organization will work, but there are no guarantees. 
> If necessary, I'll revert to the old stack-based architecture, with all of 
> its complexities.
>
>
> *The python importer is inherently complex*
>
> Aha! The python importer is intrinsically at least as complex as the 
> javascript importer, and perhaps more so! This complexity has been quite a 
> shock!
>
> How can this be? Doesn't python impose strict standards for indentation 
> and structure?
>
> *Strangely indented lines*
>
> Alas, the answer is "yes and no." :-)  *Most* of the time python classes, 
> methods, and functions follow a simple format.  But not always!  For 
> example, the following is a valid python program! Try it! 
>
> if 1:
>  print('indent 1')
> if 2:
>   print('indent 2')
> if 3:
>    print('indent 3')
> if 4:
>     print('indent 4')
> if 5:
>      print('indent 5')
>
> Who would do such a thing, you ask?  Well, mypy unit tests, for one. Those 
> unit tests contain other strange (valid!) constructions.
>
> Furthermore, one could replace the "print" statements above with "class" 
> or "def" statements, and one could imagine similar strange "if" statements 
> *within* the range of a class definition!
>
> *Important*: strangely-indented lines can only happen within the range of 
> compound statements such as "if", "for", "while", and "with", etc.  But 
> "class" and "def" statements are also compound statements in this sense!  
> It's quite a mess. 
>
> *Keeping track of indentation*
>
> In short, the python importer can not assume *anything* about what 
> indentation may be in effect in the range of a class definition!
>
> As noted above, the python importer assigns a *vnode kind* for each 
> generated vnode. The valid (string) values are outer, org, class, and 
> def. Hmm., As I write this, perhaps the importer should use "method" and 
> "function" kinds instead of the generic "def" kind.
>
> The "org" kind should allow the python importer to handle 
> strangely-indented lines. Indeed, python does not allow *complete* chaos! 
> For example, the following is a syntax error:
>
> class Class1:
>     def method1():  # 4-space indentation
>         pass  # 8-space indentation.
>       def method2():  # 6-space indentation.
>           pass
>
> Python gives this error:
>
>     def method2():  # 6-space indentation.
>                                           ^
> IndentationError: unindent does not match any outer indentation level
> That is, the first statement in the range of the class determines the 
> *allowed 
> indentation* for all other statements of the class, including compound 
> statements.  Presumably, the 'indent' value for "class" nodes will be the 
> allowed indentation, but perhaps the vnode_info dict should contain *two* 
> indent-related keys.  See below.
>
> *Underindented lines*
>
> A further complication involves so-called underindented lines, that is, 
> lines that Leo can not represent properly using the natural node 
> structure.  Leo uses an ugly *escape convention* to represent such 
> lines.  Most Leonistas probably have never seen the escape convention, but 
> Leo does support it.
>
> At present, the python importer's perfect-import check allows leading 
> whitespace to be added to otherwise underindented *comment *lines (only). 
> Imo, adding this extra whitespace is preferable to using the underindented 
> convention, but I might change my mind.
>
> *Removing common leading whitespace*
>
> *Importer.undent* removes leading whitespace from generated nodes.  
> i.undent calculates the* greatest* leading whitespace in the entire node 
> and removes this whitespace from *all* lines of the nodes, inserting the 
> underindented escape sequence as necessary!
>
> The python importer will likely override i.undent (*python_i.undent*) so 
> as to never insert the underindented escape sequence. Perhaps 
> textwrap.dedent *can* be used, but that assumes that all 
> strangely-indented nodes are under the range of an `@others` directive that 
> is indented by exactly the amount that textwrap.dedent will (eventually) 
> remove!
>
> So there are a lot of constraints involved in generating nodes!
>
> *Aha! The post pass can use the vnode_info dict*
>
> As I write this, I see that the vnode_info dict has another advantage over 
> the stack-based architecture. The vnode_info dict is available to (the 
> possibly overridden) undent method. Perhaps the vnode_info dict might have 
> two indentation-related keys. We shall see.
>
> *Summary*
>
> Surprisingly, the python importer is inherently the most complex importer 
> of all.
>
> Organizer nodes will allow the importer to handle even the most bizarre 
> strange-indented nodes.  However, generating the necessary organizer nodes 
> has stumped me for several days. The task is far from easy.
>
> The base Importer class defines the architecture of all importers. There 
> is no need to improve this architecture! In particular, the line-by-line 
> nature of the gen_lines method ensures that all importers, including the 
> python importer, will be close to as fast as possible. There is no need to 
> worry about the speed of the python importer!
>
> To sum up: the task is to ensure the perfect import of *all valid python 
> programs*, regardless of indentation quirks.
>
> Edward
>
> P.S. As I write this I see that the underindented escape convention seems 
> not to be documented.  Searching for "underindentEscapeString" in leoPy.leo 
> will show the relevant code.
>
> EKR
>

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/d1eac730-4d1e-448a-967a-db00d09f0e2an%40googlegroups.com.

Reply via email to