Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

Tim Boudreau Fri, 30 Nov 2018 15:30:32 -0800

As I mentioned breifly a while back, I decided to do some patching of the
Antlr module on Github.  I'm hoping the outcome of this is both


 1. Better Antlr support - in particular
   1a. add a bunch of missing features, like code formatting and semantic
highlighting
   1b. the ability to associate a file extension with a grammar and
actually edit files with syntax and error highlighting (a preview window
lets you tie colorings to specific token types and rules) -
https://timboudreau.com/files/screen/11-30-2018_18-17-53.png
   1c. Much improved cycle time between making an edit to your grammar and
seeing if you broke something - as in, instantaneous
 2. Some modules that make integrating languages based on Antlr grammars
really easy - there are a lot of identical adaptering boilerplate everyone
needs for that, and some impedance mismatches everyone has to discover the
hard way, such as:
   2a. If your grammar doesn't consume EOF, it may not tokenize the entire
file, and that will make your netbeans lexer explode horribly in the middle
of painting the main window, making the IDE unusable
   2b. Antlr EOF tokens may actually contain some text, which, if you don't
consume it, also detonates a bomb in the middle of painting

In particular, doing the dynamic syntax highlighting means programmatically
registering new languages that appear, disappear, and have their set of
tokens change on the fly, persisting the mime-type:grammar mappings across
restarts, *and* doing something reasonable in the case that the grammar was
deleted or is in an unusable state.  Not to mention generating Java source
files with Antlr into an in-memory filesystem, running javac against all
that and invoking the result in an isolated classloader, extracting a
complete lex and parse, and feeding that into all of that language
machinery (believe it or not, on my laptop, all of that can run in 82
milliseconds for a fairly complex grammar - you really can run antlr
generation, compile, load and invoke a parser on every keystroke - though
it was a lot of work getting there).

NetBeans has some very nice infrastructure for declarative registration of
languages, most of which is not terribly useful for this. Fortunately,
registering a MimeDataProvider and a no-arg-constructor MIMEResolver solve
most of that.

However, getting the Lexer integration solid - i.e. a lexer that has to
work even with a completely hosed grammar driving it - seems to be my
Waterloo.  Part of the problem is that the initialization order is
necessarily backwards:  Ordinarily in NetBeans, you register a language,
the LanguageHierarchy knows the token types it has, the editor
infrastructure pokes at that at its leisure, and when it's ready, asks for
a lexer to chew on some text.  But in this case, the language hierarchy
doesn't know anything about the language until it has generated, compiled
and invoked it - i.e. *during lexer initialization* is the first chance the
language hierarchy has to actually get the set of tokens ids for the
language (to a degree I can hack around this with per-mime-type
ThreadLocal<String>, for cases where I can wrap the entry point that
triggers lexer invocation).

So, some issues I'm wondering if anyone has guidance on (is Mila Metelka
still around?  He would know this stuff inside and out):

 - LexerInput - for this case, I need to, on the first invocation of the
lexer's nextToken(), or in its constructor, extract the entire text to be
parsed, feed it through a generated Antlr lexer and build a list of tokens
- nextToken() will simply return them:
   - It is non-obvious from the code and Javadoc, whether you should call
readText() without first making calls to read() to sequentially read tokens
- it appears to work, most of the time (in which case, what is calling
read?) and is what examples generally do.
   - Sometimes you get a LexerInput which has already had some, but not
all, characters read from it.  It is not at all clear what causes that (or
whether rewinding is appropriate).
   - If you got text back from readText() during lexer initialization,
parsed it and generated a pre-cooked list of tokens to return from your
lexer, you still need to call read() to scan forward to the token you're
returning (even though if readText() returned the complete text, the
LexerInput presumably is already at EOF)
   - LexerInput behaves differently when invoked from inside the closure of
LanguageHierarchy.createLexer() versus from within a call to nextToken() -
sometimes it will return 0 length text (and be backed by a
TextLexerInputOperation that has null backing text, a readEndOffset of 0,
yet will return 65535 - Character.MAX_VALUE from calls to read()) when
created against a document that definitely does have contents - I suspect
some kind of race condition
 - WrapTokenIdCache - in the case that the set of tokens for the language
has changed (which can happen while the lexer is being constructed), it is
easy to get an AIOOBE because it is caching a set of token ids that is no
longer correct.  I do have the LanguageHierarchy fire PROP_LANGUAGE when
this is updated and throw away the existing LanguageHierarchy instance, but
that does not abort use of the lexer whose constructor is currently running
which was created by the old one (I can probably work around this by using
the stale token IDs, though it's likely to screw with highlighting for the
first reparse after every edit to the grammar).
    - In particular, this problem makes it impossible to return a "dummy
parse" of an open file during IDE startup with fake token ids for EOF and
"everything else", to avoid generating and compiling a ton of stuff before
the main window opens - whatever WrapTokenIdCache maps to, it seems to
persist past the lifetime of the LanguageHierarchy it was created for
(maybe mapped to mime type string?)
 - When you get passed an empty LexerInput and an exception is thrown as
the result of nextToken(), it appears that createLexer() is subsequently
invoked over and over, each time with a LexerInput which contains one more
character of the file
 - What is a lexer *actually* supposed to return for tokens when it is
passed zero length text?  Returning an empty token is an error.  So is
returning a zero length token. ???
 - Occasionally, completely nonsensical errors in parsing that I can't find
any explanation for, that don't directly involve my code - see stack traces
below

I have a horrible, hacky lexer implementation that accidentally works most
of the time.  Factoring the same code into something human-readable,
weirdly, turns it into something that explodes on use - and the only
difference seems to be timing and quantity of logging statements and
possibly a smidgen of call ordering.

Any suggestions or hints on better ways to do any of this are welcome.

The raunchy-but-works-most-of-the-time lexer implementation is here:
https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/features-tdb/1.2.1/ANTLR4PLGNB82/src/org/nemesis/antlr/v4/netbeans/v8/grammar/file/preview/AdhocLexer.java

(the original author, I think, must have converted a bunch of svn branch
folders to git by just committing them - the layout is kind of a mess, and
everything would be easier if I mavenized it)

-Tim


INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception
occurred during token hierarchy updating. Token hierarchy will be rebuilt
from scratch.
java.lang.IndexOutOfBoundsException: startOffset=1073741823, endOffset=80,
inputSourceText.length()=80
at
org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
at
org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
at
org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:254)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
at
org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
at
org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
at
org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)
at
org.netbeans.lib.lexer.inc.DocumentInput.insertUpdate(DocumentInput.java:117)


INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception
occurred during token hierarchy updating. Token hierarchy will be rebuilt
from scratch.
java.lang.IndexOutOfBoundsException: startOffset=-2147483569, endOffset=79,
inputSourceText.length()=79
at
org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
at
org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
at
org.netbeans.lib.lexer.inc.IncTokenList.replaceTokens(IncTokenList.java:354)
at
org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:258)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
at
org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
at
org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
at
org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)

-Tim


-- 
http://timboudreau.com

Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

Reply via email to