[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

GitBox Thu, 30 Jun 2022 05:53:09 -0700


cpcloud commented on PR #13442:
URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171183147


   > > Each row is about 10 MB of JSON.
   > 
   > So 16MB is just barely adequate and may be too small for other similar 
datasets?
   
   I guess? There's an additional 60% over 10MB, unclear without something 
concrete whether that's barely adequate for similar datasets.
   
   I don't think we're going to address this in general without implementing 
JSON reading for which block size is not a user-facing concern.
   
   > 
   > Keep in mind that the block size is not merely used for type inference, 
it's used as a unit of work for batching and parallelization. A large value 
could be detrimental to performance.
   
   It seems reasonable to trade off performance for being able to do anything 
at all. If I have to pass in the value anyway to get working code, I'm not yet 
thinking about performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] cpcloud commented on pull request #13442: ARROW-9612: [C++] increase default block_size from 1MB to 16MB

Reply via email to