Re: [jackson-user] decoding JSON payloads with pathological text encodings

Tatu Saloranta Wed, 25 Apr 2018 19:44:16 -0700

On Wed, Apr 25, 2018 at 6:51 PM,  <[email protected]> wrote:
> Hello,
>
> I wanted to inquire about a bizarre situation I've run into, with decoding a
> certain uniquely weird kind of JSON lately. I have some JSONs which come
> from a web crawling service, which is fetching webpages from all over the
> world. These pages can be formatted in various insane text encodings, which
> I wish had never existed in the first place, such as Latin1, LatinX,
> Windows-12XX, and in my current case, EUC-JP and Shift-JIS.
>
> The web crawler is generating some JSON out of this, coming into my system,
> which contains lots of hostile and sort-of illegal inputs, such as
> corrupted, unpaired, or otherwise invalid surrogates and other such byte
> sequences in the UTF-8.


Right. This does happen, alas. :-/

> Technically, Jackson can deserialize this "just fine", except not really,
> because now you have a whole ton of Java String instances in this tree,
> which have bogus / unknown / invalid / illegal bytes inside, and some tools
> farther downstream from me, trying to use my APIs, are exploding when they
> are trying to deal with these insane bytes which I need to clean up first. I
> could try to make something that goes through and un-corrupts all the
> Strings in the tree, but it's very hard to try to access the original raw
> bytes from inside these damaged Strings and fix them the way they should be.
>
> The good news is that Mozilla and some open-source hackers have made a
> library for dealing with these mangled Strings:
> https://github.com/albfernandez/juniversalchardet . However, there is the
> possibility that every String in a single JSON input from the crawler can
> have some different encoding, So, instead of trying to guess the encoding on
> the entire raw JSON, I need to try and guess the encoding on each String
> before deserializing.
>
> So, I wanted to ask if the system will let me create a custom
> StdDeserializer, which steals the deserialization of String, even though
> it's a kind-of magic builtin Java type and not a regular POJO, so I can pass
> each String through the encoding detector and un-corrupt it, so that when
> Jackson assembles the whole structure, all of the corrupt Strings have been
> eliminated as much as possible, and re-encoded into proper UTF-8, the way
> they always should have been.

At the point where deserializers handle things, decoding has already
been done, and
information potentially lost and/or corrupt. But if we go down to
lower level, decoder (`JsonParser`)
is responsible for tokenization, and is in better position.

I would probably approach this form perspective of using another
library to detect encoding
and construct `InputStreamReader` for that encoding (library may offer
that integration out of the box too),
and then use resulting reader for creating parser:

   JsonParser p = jsonFactory.createStreamReader(reader);

which may then be given as input source to `ObjectMapper` (or `ObjectReader`).

Jackson does not really have to know about potential complexity of
detecting encoding, and
attempting to fix possible Unicode errors.

-+ Tatu +-

>
> Thanks,
> Matthew.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "jackson-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"jackson-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [jackson-user] decoding JSON payloads with pathological text encodings

Reply via email to