Re: LexerInput.read() returns null characters after unicode characters.

Tim Boudreau Fri, 28 Dec 2018 16:07:58 -0800

On Thu, Dec 27, 2018 at 5:18 PM [email protected] <
[email protected]> wrote:


> First of all, thank you for the detailed answer.
>
> > LexerInput returns a primitive int. It cannot return null.
>
> Yes sir, so I get a integer value zero. EOF as you know would be a -1. I
> believe it is returning a null character '\0'.


Most likely if you are seeing a 0, it is because there is a 0.

Any chance you're reading UTF-32 as UTF-16, or something like that? That
would get you unexpected zeros that are actually the other half of a
partially read character.


> If you are using ANTLR, does your grammar read the entire file? You need a
> > rule that includes EOF explicitly, or it is easy to have a grammar which
> > looks like it works most of the time, but for some files will hand you an
> > eof token without giving you tokens for the entire file - it does what
> you
> > tell it to, so if you didn't tell it that the content to parse ends only
> > when the end of the file is encountered, then it once it has satisfied
> the
> > rules you gave it, it is "done" as far as it is concerned.
>
> This is a hand written lexer. I humbly submit that ANTLR is beyond my
> comprehension. I don't think it is even ANTLR, think it may the prospect of
> having to deal with generated code. That same reason has kept me away from
> various coffeescript, angular and a few others. I caught this issue while
> writing unit tests for the lexer. Seeing that the coverage is at 80% at
> present, I should say I haven't encountered any unpredictable EOFs so far.
> Since I do the integer comparison manually using ==, it is hard to miss EOF
> characters.


In practice, the generated Antlr code is pretty easy to deal with, but I've
hand-written lexers too. I wouldn't compare it with coffeescript and
similar, where your entire program is made pretty opaque - you get some
straightforward ast classes and visitor interfaces.


> So, when in that state, read the remaining characters (if any) into a
> StringBuilder, log them to stdout, see
> > what they are and modify your grammar or whatever does the lexing to
> ensure
> > they really get processed.
>
> I will definitely try this. My suspicion is there are some invisible
> characters I am not seeing. May be printing them to console will help.


Check what character encoding the bytes are being read with. If you're not
specifying it, you get whatever the system default is, which is always
wrong sometimes. If the lexer input is really handing you zeros, that's
probably the culprit.

-Tim





>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> For further information about the NetBeans mailing lists, visit:
> https://cwiki.apache.org/confluence/display/NETBEANS/Mailing+lists
>
>
>
> --
http://timboudreau.com

Re: LexerInput.read() returns null characters after unicode characters.

Reply via email to