[GitHub] [arrow] eyevz opened a new issue, #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset

GitBox Fri, 14 Oct 2022 17:44:46 -0700


eyevz opened a new issue, #14426:
URL: https://github.com/apache/arrow/issues/14426


   I would like to create a dataset over a number of CSV files, specify the 
schema for the files, and for the partitioning, and have the dataset infer the 
partition dictionary.
   
   Is there anything obviously wrong with what I'm doing below?
   
   This approach _almost_ works:
   ```python
   part_schema = pa.schema([
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   partitioning = ds.partitioning(
       schema=part_schema,
       dictionaries='infer',
   )
   
   dataset = ds.dataset(
       list_of_csv_files,
       format='csv',
       partitioning=partitioning,
       partition_base_dir=appropriate_root_path,
   )
   ```
   With the above, `dataset.partitioning.dictionaries` is appropriately 
populated. However I'm not happy with the inference of the CSV file schema.
   
   If I specify the dataset schema as below, it breaks the partition dict 
inference:
   ```python
   ds_schema = pa.schema([
       pa.field('csv_field', pa.int8()),
       pa.field('partition_label', pa.string()),
   ])
   
   part_schema = pa.schema([
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   partitioning = ds.partitioning(
       schema=part_schema,
       dictionaries='infer',
   )
   
   dataset = ds.dataset(
       list_of_csv_files,
       format='csv',
       schema=ds_schema,
       partitioning=partitioning,
       partition_base_dir=appropriate_root_path,
   )
   ```
   At this point the schema for dataset is what I want, but 
`dataset.partitioning.dictionaries` is `[None]`.
   
   If I attempt to specify that `partition_label` is a dictionary field in the 
dataset schema, as in the below...
   ```python
   ds_schema = pa.schema([
       pa.field('csv_field', pa.int8()),
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   part_schema = pa.schema([
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   partitioning = ds.partitioning(
       schema=part_schema,
       dictionaries='infer',
   )
   
   dataset = ds.dataset(
       list_of_csv_files,
       format='csv',
       schema=ds_schema,
       partitioning=partitioning,
       partition_base_dir=appropriate_root_path,
   )
   ```
   ... then I get an `ArrowInvalid` error indicating that I have not provided a 
dictionary for field `partition_label`.
   
   Any suggestions for how I can specify the schema of a partitioned dataset 
over a large number of CSV files and have the dataset infer the partition 
dictionary?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] eyevz opened a new issue, #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset

Reply via email to