[GitHub] [arrow] aiqc opened a new issue #9932: Read parquet via row indexes to support chunking?

GitBox Wed, 07 Apr 2021 09:10:13 -0700


aiqc opened a new issue #9932:
URL: https://github.com/apache/arrow/issues/9932



   > We have GitHub issues available as a way for new contributors and
   passers-by who are unfamiliar with Apache Software Foundation projects
   to ask questions and interact with the project. Do not be surprised if
   the first response is to open a JIRA issue or to write an e-mail to
   one of the public mailing lists:
   
   Hi there. Is there a way to read a Parquet file by way of row (aka index) 
range? Not seeing it in `pyarrow.parquet.read_table` and there are questions 
about it:
   
   - 
https://stackoverflow.com/questions/64050609/pyarrow-read-parquet-via-column-index-or-order
   - 
https://stackoverflow.com/questions/62252259/pandas-read-write-parquet-data-using-column-index
   
   Right now I just read the whole file in and then drop rows, which won't be 
feasible on larger datasets.
   ```
   df = pd.read_parquet(my_stream)
   df = df.iloc[samples_indices]
   ```
   
   I feel like I could do this with Spark, but don't want to add that 
dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] aiqc opened a new issue #9932: Read parquet via row indexes to support chunking?

Reply via email to