cpcloud commented on PR #13442: URL: https://github.com/apache/arrow/pull/13442#issuecomment-1171166647
Datasets that come from JSON-producing APIs often have unpredictable blob sizes, so it's not easy to make agreeably objective statements about frequency of occurrence. Even we had a count of N datasets that had "large" rows who's to say that's frequent or not? The main point is in the short term to have a default `block_size` that's big enough to accommodate "unreasonably" large rows without forcing users to fiddle with it, and in the medium to long term to implement a solution using block resizing or perhaps expore use of a streaming JSON parser that may allow a constant block size. We're currently working with 2020 US election data pulled (at the time of the election) from the New York Times. Each row is about 10 MB of JSON. There's a _ton_ of nesting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org