On Tue, Oct 29, 2019 at 10:52 AM Алексей Алефиров <[email protected]>
wrote:

> Hi,
>
> For project I'm working on, I need to get tokens from a source file with
> it's start and end location. Using `JsonParser`, and it's methods
> `getTokenLocation` and `getCurrentLocation` with theirs `getLineNr` and
> `getColumnNr` seemed to me like a perfect solution. Unfortunately, it
> turned out, for a field name it `getCurrentLocation` is a either position
> where field value token (next token) ends or before new line, whichever
> coming earlier. I would expect it to be right after closing double-quote.
>

Yes, this is due to an implementation detail: an optimization made at some
point (2.0 maybe) changed handling a bit so that in addition to FIELD_NAME
tokenization, start of the following token is inspected. This leads to
current location being bit ahead of what might be otherwise excepted.



>
> {
>  "fieldName" : "fieldValue"
> }
>
>
>
> Philosophical/Design question: is this okay?
>
> Practical question: for my purposes, is this appropriate to use my
> original solution (with `getTokenLocation` and `getCurrentLocation`) and
> for FIELD_NAME override it with `getTokenLocation.getColumnNr` +
> `getTextLength` + 2 (for opening and closing double-quotes)?
>

Ok, so. Starting with difference on "current" and "token" location: former
is meant to be helpful for error messages, and ideally indicating specific
character in input stream where something problematic was found wrt
tokenization. It may be in the middle of a token, so it typically won't be
super helpful for many automated use cases (like trying to outline a token
or make changes).
But it is not designed or meant to give information on token boundaries:
its value will be affected by "lazy parsing" for tokens (for JSON Strings,
for example, location will be after opening double-quote, after
JsonToken.VALUE_STRING is returned, but will move if actual contents are
requested).
So it can not be used to indicate token boundaries reliably.

Token location, on the other hand, should point to the very first character
that is part of the token that was returned. So excluding any possible
preceding white space and/or separators (and, in non-compliant modes,
comments).
If this location is incorrect that would be a bug and a new issue should be
filed along with reproduction.

The challenge in your case, then (assuming token location was accurate)
would be finding token end location. I think you are correct in that for
everything else except for JsonToken.FIELD_NAME, current location will
point to character right after token, as long as value has been accessed.
That is:

* For string values, one of String accessors is called (getText() or end
offset)
* For numeric values, matching accessors (or getNumber(), or even getText())
* all other tokens (start/end markers, null/true/false) are fully tokenized
right away.

FIELD_NAME case is trickier, however. Length of field name is not
sufficient since there may be escaping (backslash). If you have all
content, you could backtrack from start of following (value) token,
although comments might be problematic. Or, you could traverse from token
start towards the end, find closing double-quote (and observing
backslashes).

-+ Tatu +-

-- 
You received this message because you are subscribed to the Google Groups 
"jackson-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jackson-user/CAGrxA25L7OWYqMJpcYsF3CYP%2Bv2SqDawMCz10VvqZ9Apk0aGRw%40mail.gmail.com.

Reply via email to