[GitHub] [arrow] StuartHadfield opened a new issue, #33930: Partition by Split Date Column

via GitHub Mon, 30 Jan 2023 06:41:47 -0800


StuartHadfield opened a new issue, #33930:
URL: https://github.com/apache/arrow/issues/33930


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Suppose I have a table that includes a `date` column, and I want to 
partition in the form:
   
   `year=2011/month=10/day=26/part-0.parquet`
   
   For writing a dataset, how do I accomplish this? Is the only option to 
preprocess the table prior to writing?
   
   For e.g.
   
   ```py
   import pyarrow as pa
   from pyarrow import dataset as ds
   from datetime import datetime
   
   schema = pa.schema([('foo', pa.string()), ('date', pa.date32())])
   my_batch = pa.RecordBatch.from_pylist(
     [
       {'foo': 'bar', 'date': datetime(2022, 1, 1)
     ],
     schema=schema,
   )
   
   
   ds.write_dataset(
     my_batch,
     base_dir='./',
     format='parquet',
     partitioning='date', # Presumably I can split this here? Maybe preprocess 
the date column, or pass a schema of some sort?
     flavor='hive',
   )
   ```
   
   Naturally right now, I'll get partitions like 
`date=2022-01-01/part-0.parquet`, which isn't what I want.
   
   If the answer is just to preprocess my source data - that's okay. I'm just 
finding the docs on partitioning a little confusing. Thanks!
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] StuartHadfield opened a new issue, #33930: Partition by Split Date Column

Reply via email to