[GitHub] [arrow] westonpace commented on issue #11826: Partition in dataset

GitBox Wed, 01 Dec 2021 18:14:44 -0800


westonpace commented on issue #11826:
URL: https://github.com/apache/arrow/issues/11826#issuecomment-984228962



   The `read_parquet` operation needs to know more about the partitioning.  At 
the moment it is seeing files like `{output_path}/1234/chunk_0_0.parquet` and 
it doesn't know if the `1234` is meant to be a partition column (and if so what 
should the column name be?).  So instead it just does a recursive search of all 
files and pretends the inner directories didn't exist.
   
   You have two options.  First, you can specify the partitioning on the read:
   
   `pd.read_parquet(path, partitioning=["code"], filters=[('code', '=', 
'1234')])`
   
   Or, if you don't want to have to keep track of it, you can use the `hive` 
partitioning flavor when you write:
   
   ```
   pa.dataset.write_dataset(
               table,
               output_path,
               basename_template=f"chunk_{y}_{{i}}",
               format="parquet",
               partitioning=["code"],
               partitioning_flavor="hive",
               existing_data_behavior="overwrite_or_ignore",
           )
   ```
   
   This will create paths like `{output_path}/code=1234/chunk_0_0.parquet`.  
The `code=1234` is clear enough to pyarrow's inference that it will assume that 
is a partitioning directory and the column is named `code`.  So then you can 
use the `read_parquet` call you have as-is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #11826: Partition in dataset

Reply via email to