We have some biological datasets with 10,000s of columns and 100,000s of rows. Currently, we are storing the data in Parquet files and using pyarrow+pandas to read and filter the data. Users specify queries to filter the rows. And they indicate which columns they want to select.
This works well in most cases. But sometimes our users want to select *all* columns (but only some of the rows). Currently, we read all the data into a pandas DataFrame and then filter the rows. This is memory intensive when all columns are selected because the full dataset needs to be read. We are looking for a way to filter the rows before pulling data for all the columns. Could we do something like the following? 1. Read only the columns specified in the filtering criteria and identify row indices that match the filtering criteria. 2. Read the rows that match those row indices. Perhaps (probably?) we are thinking about this all wrong. Or perhaps Parquet is the wrong tool for the job. If you have any pointers or tips, we would greatly appreciate them. Thanks for your time! [ Full content available at: https://github.com/apache/arrow/issues/2491 ] This message was relayed via gitbox.apache.org for [email protected]
