On Wed, Apr 25, 2018 at 6:51 PM, <[email protected]> wrote: > Hello, > > I wanted to inquire about a bizarre situation I've run into, with decoding a > certain uniquely weird kind of JSON lately. I have some JSONs which come > from a web crawling service, which is fetching webpages from all over the > world. These pages can be formatted in various insane text encodings, which > I wish had never existed in the first place, such as Latin1, LatinX, > Windows-12XX, and in my current case, EUC-JP and Shift-JIS. > > The web crawler is generating some JSON out of this, coming into my system, > which contains lots of hostile and sort-of illegal inputs, such as > corrupted, unpaired, or otherwise invalid surrogates and other such byte > sequences in the UTF-8.
Right. This does happen, alas. :-/ > Technically, Jackson can deserialize this "just fine", except not really, > because now you have a whole ton of Java String instances in this tree, > which have bogus / unknown / invalid / illegal bytes inside, and some tools > farther downstream from me, trying to use my APIs, are exploding when they > are trying to deal with these insane bytes which I need to clean up first. I > could try to make something that goes through and un-corrupts all the > Strings in the tree, but it's very hard to try to access the original raw > bytes from inside these damaged Strings and fix them the way they should be. > > The good news is that Mozilla and some open-source hackers have made a > library for dealing with these mangled Strings: > https://github.com/albfernandez/juniversalchardet . However, there is the > possibility that every String in a single JSON input from the crawler can > have some different encoding, So, instead of trying to guess the encoding on > the entire raw JSON, I need to try and guess the encoding on each String > before deserializing. > > So, I wanted to ask if the system will let me create a custom > StdDeserializer, which steals the deserialization of String, even though > it's a kind-of magic builtin Java type and not a regular POJO, so I can pass > each String through the encoding detector and un-corrupt it, so that when > Jackson assembles the whole structure, all of the corrupt Strings have been > eliminated as much as possible, and re-encoded into proper UTF-8, the way > they always should have been. At the point where deserializers handle things, decoding has already been done, and information potentially lost and/or corrupt. But if we go down to lower level, decoder (`JsonParser`) is responsible for tokenization, and is in better position. I would probably approach this form perspective of using another library to detect encoding and construct `InputStreamReader` for that encoding (library may offer that integration out of the box too), and then use resulting reader for creating parser: JsonParser p = jsonFactory.createStreamReader(reader); which may then be given as input source to `ObjectMapper` (or `ObjectReader`). Jackson does not really have to know about potential complexity of detecting encoding, and attempting to fix possible Unicode errors. -+ Tatu +- > > Thanks, > Matthew. > > > -- > You received this message because you are subscribed to the Google Groups > "jackson-user" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "jackson-user" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. For more options, visit https://groups.google.com/d/optout.
