cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171166647

   Datasets that come from JSON-producing APIs often have unpredictable blob 
sizes, so it's not easy to make agreeably objective statements about frequency 
of occurrence.
   
   Even we had a count of N datasets that had "large" rows who's to say that's 
frequent or not?
   
   The main point is in the short term to have a default `block_size` that's 
big enough to accommodate "unreasonably" large rows without forcing users to 
fiddle with it, and in the medium to long term to implement a solution using 
block resizing or perhaps expore use of a streaming JSON parser that may allow 
a constant block size.
   
   We're currently working with 2020 US election data pulled (at the time of 
the election) from the New York Times. Each row is about 10 MB of JSON. There's 
a _ton_ of nesting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to