Re: [jackson-user] decoding JSON payloads with pathological text encodings

Tatu Saloranta Thu, 26 Apr 2018 15:16:43 -0700

On Wed, Apr 25, 2018 at 8:50 PM,  <[email protected]> wrote:
> On Wednesday, April 25, 2018 at 7:43:44 PM UTC-7, Tatu Saloranta wrote:
>>
>> At the point where deserializers handle things, decoding has already
>> been done, and
>> information potentially lost and/or corrupt. But if we go down to
>> lower level, decoder (`JsonParser`)
>> is responsible for tokenization, and is in better position.
>>
>> I would probably approach this form perspective of using another
>> library to detect encoding
>> and construct `InputStreamReader` for that encoding (library may offer
>> that integration out of the box too),
>> and then use resulting reader for creating parser:
>>
>>    JsonParser p = jsonFactory.createStreamReader(reader);
>>
>> which may then be given as input source to `ObjectMapper` (or
>> `ObjectReader`).
>>
>> Jackson does not really have to know about potential complexity of
>> detecting encoding, and
>> attempting to fix possible Unicode errors.
>>
>> -+ Tatu +-
>
>
> Yes, this would certainly be a preferable solution, if I actually always
> knew what encoding to use for the entire JSON document, but sadly it can
> vary per-String-valued-field. This means that, within a single document,
> there is a possibility that every String could have some different encoding.
>
> So, instead of trying to guess the encoding on the entire raw JSON, I need
> to hook in try and guess the encoding on each String-valued field when
> constructing the String value for the field itself.
>
> So I am trying to understand, what is the right place to intercept the
> creation of the String for every String-valued field? Then I can call the
> encoding guesser, and construct the String or CharSequence for the
> String-valued field myself, where I can do some tricks to un-mangle the
> bytes.


Deserializers ask `JsonParser`, either via `getText()` or one of
variants (`nextTextValue()`).
Decoding is handled by parser, possibly eagerly (when `nextToken()` is
called), possibly lazily (implementation dependant)
Converting from byte stream to tokens is what parser does, and not
something deserializers have
direct effect on.

So you would need to reimplement parser to make it flexible enough,
and figure out how to pass
information on alternate encoding(s) somehow.

-+ Tatu +-

-- 
You received this message because you are subscribed to the Google Groups 
"jackson-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [jackson-user] decoding JSON payloads with pathological text encodings

Reply via email to