gianm edited a comment on pull request #10383:
URL: https://github.com/apache/druid/pull/10383#issuecomment-691239531
> I don't know whether jackson provides anyway to skip the ill-formed text,
but I guess there's no such API for a parser.
> So what's your suggestion on this problem ? @gianm @jihoonson
Hmm, this is definitely a problem. When we're reading JSON from a file, we
should be skipping the "lines" that aren't parseable, and marking them as
unparseable. I guess that ObjectMapper API was too good to be true 🙂
Maybe we can do one of the following two things.
Option 1: After an error we can manually skip to the line that starts with
`{` and begin parsing there. Because the JsonParser has a buffer, we need to
make sure we don't miss buffered but unparsed content. We might be able to do
that with the `JsonParser.releaseBuffered` method, assuming it works properly
after an error.
Option 2: Introduce a `lineSplittable` parameter for the JsonReader like you
originally suggested, but it would behave differently in two ways. First: it
wouldn't be available on the JsonInputFormat. Instead, it would be
automatically set to `true` for batch ingestion and `false` for streaming
ingestion. Second: `false` doesn't mean "not line splittable", it just means
"we should auto detect". In the `true` case (batch ingestion), the JsonReader
will split on lines and parse each line individually. Batch ingestion won't
support pretty-printed input, but it never did in the past, and it doesn't seem
likely that it's a common requirement anyway. In the `false` case (streaming
ingestion), the JsonReader would use the ObjectMapper.readValues API, and it
would support all these different kinds of payloads. But if there's an error,
it would reject the entire payload. That all-or-nothing behavior seems OK for a
streaming ingestion use case.
IMO Option 2 sounds nicer, since it doesn't involve doing weird stuff with
the ObjectReader APIs to try to skip unparseable data.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]