gianm edited a comment on pull request #10383:
URL: https://github.com/apache/druid/pull/10383#issuecomment-691239531


   > I don't know whether jackson provides anyway to skip the ill-formed text, 
but I guess there's no such API for a parser.
   > So what's your suggestion on this problem ? @gianm @jihoonson
   
   Hmm, this is definitely a problem. When we're reading JSON from a file, we 
should be skipping the "lines" that aren't parseable, and marking them as 
unparseable. I guess that ObjectMapper API was too good to be true 🙂
   
   Maybe we can do one of the following two things.
   
   Option 1: After an error we can manually skip to the line that starts with 
`{` and begin parsing there. Because the JsonParser has a buffer, we need to 
make sure we don't miss buffered but unparsed content. We might be able to do 
that with the `JsonParser.releaseBuffered` method, assuming it works properly 
after an error.
   
   Option 2: Introduce a `lineSplittable` parameter for the JsonReader like you 
originally suggested, but it would behave differently in two ways. First: it 
wouldn't be available on the JsonInputFormat. Instead, it would be 
automatically set to `true` for batch ingestion and `false` for streaming 
ingestion. Second: `false` doesn't mean "not line splittable", it just means 
"we should auto detect". In the `true` case (batch ingestion), the JsonReader 
will split on lines and parse each line individually. Batch ingestion won't 
support pretty-printed input, but it never did in the past, and it doesn't seem 
likely that it's a common requirement anyway. In the `false` case (streaming 
ingestion), the JsonReader would use the ObjectMapper.readValues API, and it 
would support all these different kinds of payloads. But if there's an error, 
it would reject the entire payload. That all-or-nothing behavior seems OK for a 
streaming ingestion use case.
   
   IMO Option 2 sounds nicer, since it doesn't involve doing weird stuff with 
the ObjectReader APIs to try to skip unparseable data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to