Re: [I] Hive partition columns with leading underscore: No match for FieldRef.Name(_file) [arrow]

via GitHub Wed, 09 Oct 2024 05:10:37 -0700


tmontes commented on issue #44352:
URL: https://github.com/apache/arrow/issues/44352#issuecomment-2402144910


   WORKING VARIATION:
   
   ```
   import pathlib
   import tempfile
   
   import pandas as pd
   import pyarrow.dataset as ds
   
   
   YEAR_COLUMN = '_year'
   FILE_COLUMN = '_file'
   
   
   with tempfile.TemporaryDirectory() as td:
   
       dataset_path = pathlib.Path(td) / 'dataset'
   
       # create parquet dataset partitioned by YEAR_COLUMN / FILE_COLUMN
       pd.DataFrame([
           {'data': 0, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
           {'data': 1, YEAR_COLUMN: 2020, FILE_COLUMN: 'a'},
           {'data': 2, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
           {'data': 4, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
           {'data': 5, YEAR_COLUMN: 2020, FILE_COLUMN: 'b'},
           {'data': 6, YEAR_COLUMN: 2021, FILE_COLUMN: 'b'},
           {'data': 7, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
           {'data': 8, YEAR_COLUMN: 2021, FILE_COLUMN: 'c'},
       ]).to_parquet(
           dataset_path,
           partition_cols=[YEAR_COLUMN, FILE_COLUMN],
           index=False,
       )
   
       # get dataset row_count for a given FILE_COLUMN value: 'a' in this case
       dataset = ds.dataset(
           dataset_path,
           partitioning=ds.partitioning(flavor='hive'),
           # required not to ignore the leading underscores in YEAR/FILE_COLUMN
           ignore_prefixes=['.'],
       )
       row_count_for_file_a = sum(
           batch.num_rows
           for batch in dataset.to_batches(
               columns=[YEAR_COLUMN],
               filter=(ds.field(FILE_COLUMN) == 'a')
           )
       )
       assert row_count_for_file_a == 2
   ```
   
   CLOSING
   
   PS: Thanks and sorry for the noise! :-)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Hive partition columns with leading underscore: No match for FieldRef.Name(_file) [arrow]

Reply via email to