alamb opened a new issue, #9329:
URL: https://github.com/apache/arrow-rs/issues/9329

   I have a big-picture question:
   
   What trade-offs are we willing to make on validation of JSON values we will 
ultimately discard?
   
   At one extreme, we could fully parse and validate everything and just choose 
not to append the skipped bits to the tape afterward.
   * CON: Strongly limits performance gain of skipping, because parsing and 
validation are the lion's share of work.
   
   At the other extreme, we completely ignore the bytes corresponding to 
skipped values, other than the bare minimum to be relatively confident we 
correctly identified byte range to skip.
   * CON: Accepts blatantly invalid JSON as long as the bytes satisfy whatever 
region identification heuristics we come up with.
   * CON: Risk of identifying the wrong region and skipping bytes that should 
not have been skipped.
   
   I think this PR currently leans toward the lenient-for-max-performance end 
of the spectrum. That's not necessarily bad, but the PR doesn't really talk 
about the trade-off. For example, if we decide we want to be maximally lenient 
in order to skip as quickly as possible, this PR may not be aggressive enough 
(dunno, haven't explored that direction yet). On the other hand, if we favor 
correctness even for skipped values, then this PR is probably too lenient (a 
motivating factor behind some of my previous comments, which I wasn't fully 
self-aware of at the time).
   
   Do we know what we want?
   
   _Originally posted by @scovich in 
https://github.com/apache/arrow-rs/issues/9097#issuecomment-3818890681_
               


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to