StuartHadfield opened a new issue, #33930:
URL: https://github.com/apache/arrow/issues/33930
### Describe the usage question you have. Please include as many useful
details as possible.
Suppose I have a table that includes a `date` column, and I want to
partition in the form:
`year=2011/month=10/day=26/part-0.parquet`
For writing a dataset, how do I accomplish this? Is the only option to
preprocess the table prior to writing?
For e.g.
```py
import pyarrow as pa
from pyarrow import dataset as ds
from datetime import datetime
schema = pa.schema([('foo', pa.string()), ('date', pa.date32())])
my_batch = pa.RecordBatch.from_pylist(
[
{'foo': 'bar', 'date': datetime(2022, 1, 1)
],
schema=schema,
)
ds.write_dataset(
my_batch,
base_dir='./',
format='parquet',
partitioning='date', # Presumably I can split this here? Maybe preprocess
the date column, or pass a schema of some sort?
flavor='hive',
)
```
Naturally right now, I'll get partitions like
`date=2022-01-01/part-0.parquet`, which isn't what I want.
If the answer is just to preprocess my source data - that's okay. I'm just
finding the docs on partitioning a little confusing. Thanks!
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]