[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

GitBox Thu, 30 Jun 2022 05:36:20 -0700


cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171166647


   Datasets that come from JSON-producing APIs often have unpredictable blob 
sizes, so it's not easy to make agreeably objective statements about frequency 
of occurrence.
   
   Even we had a count of N datasets that had "large" rows who's to say that's 
frequent or not?
   
   The main point is in the short term to have a default `block_size` that's 
big enough to accommodate "unreasonably" large rows without forcing users to 
fiddle with it, and in the medium to long term to implement a solution using 
block resizing or perhaps expore use of a streaming JSON parser that may allow 
a constant block size.
   
   We're currently working with 2020 US election data pulled (at the time of 
the election) from the New York Times. Each row is about 10 MB of JSON. There's 
a _ton_ of nesting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Reply via email to