eyevz opened a new issue, #14426:
URL: https://github.com/apache/arrow/issues/14426
I would like to create a dataset over a number of CSV files, specify the
schema for the files, and for the partitioning, and have the dataset infer the
partition dictionary.
Is there anything obviously wrong with what I'm doing below?
This approach _almost_ works:
```python
part_schema = pa.schema([
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
partitioning = ds.partitioning(
schema=part_schema,
dictionaries='infer',
)
dataset = ds.dataset(
list_of_csv_files,
format='csv',
partitioning=partitioning,
partition_base_dir=appropriate_root_path,
)
```
With the above, `dataset.partitioning.dictionaries` is appropriately
populated. However I'm not happy with the inference of the CSV file schema.
If I specify the dataset schema as below, it breaks the partition dict
inference:
```python
ds_schema = pa.schema([
pa.field('csv_field', pa.int8()),
pa.field('partition_label', pa.string()),
])
part_schema = pa.schema([
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
partitioning = ds.partitioning(
schema=part_schema,
dictionaries='infer',
)
dataset = ds.dataset(
list_of_csv_files,
format='csv',
schema=ds_schema,
partitioning=partitioning,
partition_base_dir=appropriate_root_path,
)
```
At this point the schema for dataset is what I want, but
`dataset.partitioning.dictionaries` is `[None]`.
If I attempt to specify that `partition_label` is a dictionary field in the
dataset schema, as in the below...
```python
ds_schema = pa.schema([
pa.field('csv_field', pa.int8()),
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
part_schema = pa.schema([
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
partitioning = ds.partitioning(
schema=part_schema,
dictionaries='infer',
)
dataset = ds.dataset(
list_of_csv_files,
format='csv',
schema=ds_schema,
partitioning=partitioning,
partition_base_dir=appropriate_root_path,
)
```
... then I get an `ArrowInvalid` error indicating that I have not provided a
dictionary for field `partition_label`.
Any suggestions for how I can specify the schema of a partitioned dataset
over a large number of CSV files and have the dataset infer the partition
dictionary?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]