corwinjoy commented on issue #39676: URL: https://github.com/apache/arrow/issues/39676#issuecomment-1899358756
@emkornfield @mapleFU I 100% agree that these giant tables with huge numbers of rows and columns are an anti-pattern. Yet, here we are. It all started out so well with reasonably sized tables, but then, over the years we have found many informative columns for our models and collected more data so that now we are up to tens of thousands of columns and millions of rows. Eventually, we hope to add some kind of indexing layer (e.g. Apache Iceberg, database, etc.). But that will require a significant investment in design, maintenance, and even refactoring our code away from Arrow. In the meantime, we'd really like to make arrow performance a lot better for this case and I think we are not the only users whose data has grown over the years. As for the random sampling, I think this is a pretty common usage case, many deep learning models sample random batches for training. Also, ensemble methods such as random forests will randomly sample rows and columns to create "independent" models. So, back to the high-level design for faster reads there are two parts: 1. Read a "minimal" set of metadata. This acts as a kind of "prototype" for the individual rowgroups. Yes, this is slightly risky since I think it is technically possible for a parquet file to change up the set of columns inside the file, but I'd rather just throw an error for this case. (Although I'm not sure how to detect it / have not built a test case for it.) 2. Leverage the (PageIndex)[https://github.com/apache/parquet-format/blob/master/PageIndex.md] feature to point to the correct rowgroup data. This is designed to make "point lookups I/O efficient" so we should be able to leverage it and this PR shows one attempt to do so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
