RE: Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

Eirik Bakke Fri, 30 Nov 2018 17:14:52 -0800

Just in case it's useful: A while back I wrote a very carefully implemented 
adapter from NetBeans' org.netbeans.spi.lexer.LexerInput class to ANTLR's 
org.antlr.v4.runtime.CharStream class:


https://gist.github.com/eirikbakke/51cf4c9375880acd4741/c83dd7e64b91674c6c2bf9d8473c7249a6d66ceb

The equivalent class in the repo you point to seems to be this one:

https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/master/1.2.1/ANTLR4PLGNB82/src/org/nemesis/antlr/v4/netbeans/v8/grammar/code/coloring/ANTLRv4CharStream.java

I remember jumping through various hoops to deal with EOF correctly... you 
could always try to replace the existing CharStream implementation with mine 
and see what changes.

-- Eirik

-----Original Message-----
From: Tim Boudreau <[email protected]> 
Sent: Friday, November 30, 2018 6:29 PM
To: [email protected]; [email protected]
Subject: Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

As I mentioned breifly a while back, I decided to do some patching of the Antlr 
module on Github.  I'm hoping the outcome of this is both

 1. Better Antlr support - in particular
   1a. add a bunch of missing features, like code formatting and semantic 
highlighting
   1b. the ability to associate a file extension with a grammar and actually 
edit files with syntax and error highlighting (a preview window lets you tie 
colorings to specific token types and rules) - 
https://timboudreau.com/files/screen/11-30-2018_18-17-53.png
   1c. Much improved cycle time between making an edit to your grammar and 
seeing if you broke something - as in, instantaneous  2. Some modules that make 
integrating languages based on Antlr grammars really easy - there are a lot of 
identical adaptering boilerplate everyone needs for that, and some impedance 
mismatches everyone has to discover the hard way, such as:
   2a. If your grammar doesn't consume EOF, it may not tokenize the entire 
file, and that will make your netbeans lexer explode horribly in the middle of 
painting the main window, making the IDE unusable
   2b. Antlr EOF tokens may actually contain some text, which, if you don't 
consume it, also detonates a bomb in the middle of painting

In particular, doing the dynamic syntax highlighting means programmatically 
registering new languages that appear, disappear, and have their set of tokens 
change on the fly, persisting the mime-type:grammar mappings across restarts, 
*and* doing something reasonable in the case that the grammar was deleted or is 
in an unusable state.  Not to mention generating Java source files with Antlr 
into an in-memory filesystem, running javac against all that and invoking the 
result in an isolated classloader, extracting a complete lex and parse, and 
feeding that into all of that language machinery (believe it or not, on my 
laptop, all of that can run in 82 milliseconds for a fairly complex grammar - 
you really can run antlr generation, compile, load and invoke a parser on every 
keystroke - though it was a lot of work getting there).

NetBeans has some very nice infrastructure for declarative registration of 
languages, most of which is not terribly useful for this. Fortunately, 
registering a MimeDataProvider and a no-arg-constructor MIMEResolver solve most 
of that.

However, getting the Lexer integration solid - i.e. a lexer that has to work 
even with a completely hosed grammar driving it - seems to be my Waterloo.  
Part of the problem is that the initialization order is necessarily backwards:  
Ordinarily in NetBeans, you register a language, the LanguageHierarchy knows 
the token types it has, the editor infrastructure pokes at that at its leisure, 
and when it's ready, asks for a lexer to chew on some text.  But in this case, 
the language hierarchy doesn't know anything about the language until it has 
generated, compiled and invoked it - i.e. *during lexer initialization* is the 
first chance the language hierarchy has to actually get the set of tokens ids 
for the language (to a degree I can hack around this with per-mime-type 
ThreadLocal<String>, for cases where I can wrap the entry point that triggers 
lexer invocation).

So, some issues I'm wondering if anyone has guidance on (is Mila Metelka still 
around?  He would know this stuff inside and out):

 - LexerInput - for this case, I need to, on the first invocation of the 
lexer's nextToken(), or in its constructor, extract the entire text to be 
parsed, feed it through a generated Antlr lexer and build a list of tokens
- nextToken() will simply return them:
   - It is non-obvious from the code and Javadoc, whether you should call
readText() without first making calls to read() to sequentially read tokens
- it appears to work, most of the time (in which case, what is calling
read?) and is what examples generally do.
   - Sometimes you get a LexerInput which has already had some, but not all, 
characters read from it.  It is not at all clear what causes that (or whether 
rewinding is appropriate).
   - If you got text back from readText() during lexer initialization, parsed 
it and generated a pre-cooked list of tokens to return from your lexer, you 
still need to call read() to scan forward to the token you're returning (even 
though if readText() returned the complete text, the LexerInput presumably is 
already at EOF)
   - LexerInput behaves differently when invoked from inside the closure of
LanguageHierarchy.createLexer() versus from within a call to nextToken() - 
sometimes it will return 0 length text (and be backed by a 
TextLexerInputOperation that has null backing text, a readEndOffset of 0, yet 
will return 65535 - Character.MAX_VALUE from calls to read()) when created 
against a document that definitely does have contents - I suspect some kind of 
race condition
 - WrapTokenIdCache - in the case that the set of tokens for the language has 
changed (which can happen while the lexer is being constructed), it is easy to 
get an AIOOBE because it is caching a set of token ids that is no longer 
correct.  I do have the LanguageHierarchy fire PROP_LANGUAGE when this is 
updated and throw away the existing LanguageHierarchy instance, but that does 
not abort use of the lexer whose constructor is currently running which was 
created by the old one (I can probably work around this by using the stale 
token IDs, though it's likely to screw with highlighting for the first reparse 
after every edit to the grammar).
    - In particular, this problem makes it impossible to return a "dummy parse" 
of an open file during IDE startup with fake token ids for EOF and "everything 
else", to avoid generating and compiling a ton of stuff before the main window 
opens - whatever WrapTokenIdCache maps to, it seems to persist past the 
lifetime of the LanguageHierarchy it was created for (maybe mapped to mime type 
string?)
 - When you get passed an empty LexerInput and an exception is thrown as the 
result of nextToken(), it appears that createLexer() is subsequently invoked 
over and over, each time with a LexerInput which contains one more character of 
the file
 - What is a lexer *actually* supposed to return for tokens when it is passed 
zero length text?  Returning an empty token is an error.  So is returning a 
zero length token. ???
 - Occasionally, completely nonsensical errors in parsing that I can't find any 
explanation for, that don't directly involve my code - see stack traces below

I have a horrible, hacky lexer implementation that accidentally works most of 
the time.  Factoring the same code into something human-readable, weirdly, 
turns it into something that explodes on use - and the only difference seems to 
be timing and quantity of logging statements and possibly a smidgen of call 
ordering.

Any suggestions or hints on better ways to do any of this are welcome.

The raunchy-but-works-most-of-the-time lexer implementation is here:
https://github.com/timboudreau/ANTLR4-Plugins-for-NetBeans/blob/features-tdb/1.2.1/ANTLR4PLGNB82/src/org/nemesis/antlr/v4/netbeans/v8/grammar/file/preview/AdhocLexer.java

(the original author, I think, must have converted a bunch of svn branch 
folders to git by just committing them - the layout is kind of a mess, and 
everything would be easier if I mavenized it)

-Tim


INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception 
occurred during token hierarchy updating. Token hierarchy will be rebuilt from 
scratch.
java.lang.IndexOutOfBoundsException: startOffset=1073741823, endOffset=80,
inputSourceText.length()=80
at
org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
at
org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
at
org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:254)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
at
org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
at
org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
at
org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)
at
org.netbeans.lib.lexer.inc.DocumentInput.insertUpdate(DocumentInput.java:117)


INFO [org.netbeans.lib.lexer.TokenHierarchyOperation]: Runtime exception 
occurred during token hierarchy updating. Token hierarchy will be rebuilt from 
scratch.
java.lang.IndexOutOfBoundsException: startOffset=-2147483569, endOffset=79,
inputSourceText.length()=79
at
org.netbeans.lib.lexer.TextLexerInputOperation.<init>(TextLexerInputOperation.java:58)
at
org.netbeans.lib.lexer.inc.IncTokenList.createLexerInputOperation(IncTokenList.java:276)
at
org.netbeans.lib.lexer.inc.IncTokenList.replaceTokens(IncTokenList.java:354)
at
org.netbeans.lib.lexer.inc.TokenListUpdater.updateRegular(TokenListUpdater.java:258)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate$UpdateItem.update(TokenHierarchyUpdate.java:325)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.processLevelInfos(TokenHierarchyUpdate.java:200)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.updateImpl(TokenHierarchyUpdate.java:171)
at
org.netbeans.lib.lexer.inc.TokenHierarchyUpdate.update(TokenHierarchyUpdate.java:109)
at
org.netbeans.lib.lexer.TokenHierarchyOperation.textModified(TokenHierarchyOperation.java:585)
at
org.netbeans.spi.lexer.TokenHierarchyControl.textModified(TokenHierarchyControl.java:71)
at
org.netbeans.lib.lexer.inc.DocumentInput.textModified(DocumentInput.java:128)

-Tim


--
http://timboudreau.com

RE: Lexer and LexerInput woes w/ dynamically created LanguageHierarchy

Reply via email to