Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

via GitHub Wed, 15 May 2024 20:34:48 -0700


westonpace commented on issue #5770:
URL: https://github.com/apache/arrow-rs/issues/5770#issuecomment-2113963336


   Also, I did do some experimentation here (what feels like a long time ago 
now but was at least a year ago).  Two things I observed:
   
    * The in-memory representation of the parquet metadata seemed way too high, 
orders of magnitude greater than the on-disk size of the metadata.  I assume 
this was specific to the C++ impl and there was just something I was missing 
but it was pretty significant even with relatively reasonably-sized files (e.g. 
50MiB+ per file).
    * The parquet-c++ library (and, to a lesser degree, arrow and arrow 
datasets) were not written with large #'s of columns in mind. So even if the 
format itself could work, the implementation ends up being O(N^2) in a lot of 
places where it could presumably do a lot better.
   
   I realize this second point is kind of what you are trying to show.  I'd be 
interested in measurements of the first point, at least enough to prove I was 
missing something (or this problem is unique to parquet-c++)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns [arrow-rs]

Reply via email to