[GitHub] [arrow] srp33 opened issue #2491: Querying files with 10,000s of columns and 100,000s of rows

GitHub Tue, 28 Aug 2018 12:45:09 -0700

We have some biological datasets with 10,000s of columns and 100,000s of rows. 
Currently, we are storing the data in Parquet files and using pyarrow+pandas to 
read and filter the data. Users specify queries to filter the rows. And they 
indicate which columns they want to select.


This works well in most cases. But sometimes our users want to select *all* 
columns (but only some of the rows). Currently, we read all the data into a 
pandas DataFrame and then filter the rows. This is memory intensive when all 
columns are selected because the full dataset needs to be read. We are looking 
for a way to filter the rows before pulling data for all the columns.

Could we do something like the following?

1. Read only the columns specified in the filtering criteria and identify row 
indices that match the filtering criteria.
2. Read the rows that match those row indices.

Perhaps (probably?) we are thinking about this all wrong. Or perhaps Parquet is 
the wrong tool for the job. If you have any pointers or tips, we would greatly 
appreciate them.

Thanks for your time!

[ Full content available at: https://github.com/apache/arrow/issues/2491 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [arrow] srp33 opened issue #2491: Querying files with 10,000s of columns and 100,000s of rows

Reply via email to