corwinjoy commented on issue #39676:
URL: https://github.com/apache/arrow/issues/39676#issuecomment-1899358756

   @emkornfield @mapleFU 
   I 100% agree that these giant tables with huge numbers of rows and columns 
are an anti-pattern. Yet, here we are. It all started out so well with 
reasonably sized tables, but then, over the years we have found many 
informative columns for our models and collected more data so that now we are 
up to tens of thousands of columns and millions of rows. Eventually, we hope to 
add some kind of indexing layer (e.g. Apache Iceberg, database, etc.). But that 
will require a significant investment in design, maintenance, and even 
refactoring our code away from Arrow. In the meantime, we'd really like to make 
arrow performance a lot better for this case and I think we are not the only 
users whose data has grown over the years. As for the random sampling, I think 
this is a pretty common usage case, many deep learning models sample random 
batches for training. Also, ensemble methods such as random forests will 
randomly sample rows and columns to create "independent" models.
   
   So, back to the high-level design for faster reads there are two parts:
   1. Read a "minimal" set of metadata. This acts as a kind of "prototype" for 
the individual rowgroups. Yes, this is slightly risky since I think it is 
technically possible for a parquet file to change up the set of columns inside 
the file, but I'd rather just throw an error for this case. (Although I'm not 
sure how to detect it / have not built a test case for it.)
   2. Leverage the 
(PageIndex)[https://github.com/apache/parquet-format/blob/master/PageIndex.md] 
feature to point to the correct rowgroup data. This is designed to make "point 
lookups I/O efficient" so we should be able to leverage it and this PR shows 
one attempt to do so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to