[GitHub] [arrow] jorisvandenbossche commented on issue #11027: PyArrow Parquet column partitioning

GitBox Tue, 31 Aug 2021 05:06:35 -0700


jorisvandenbossche commented on issue #11027:
URL: https://github.com/apache/arrow/issues/11027#issuecomment-909174498



   Ah, ARROW-12644 indeed only implemented the _decoding_ when reading, not the 
equivalent _encoding_ when writing. But so if we can read such datasets, we 
should probably also enable to write them? (will open a JIRA about that)
   
   @wanx4910 To show that we can read values with encoded `/` (illustrating 
what @westonpace mentioned above), I created a small dataset with two 
directories with URL encoded values (using a european date format of 
2012/01/01):
   
   ```
   In [44]: !ls test_decoding.parquet/
   2012%2F01%2F01       2012%2F01%2F02
   
   In [45]: dataset = ds.dataset("test_decoding.parquet/", 
partitioning=["date"], format="parquet")
   
   In [46]: dataset
   Out[46]: <pyarrow._dataset.FileSystemDataset at 0x7f110c345770>
   
   In [47]: dataset.to_table().to_pandas()
   Out[47]: 
      b        date
   0  1  2012/01/01
   1  2  2012/01/02
   ```
   
   So when reading, we can properly decode such values. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #11027: PyArrow Parquet column partitioning

Reply via email to