[jira] [Comment Edited] (ARROW-9573) [Python] Parquet doesn't load when partitioned column starts with '_'

Joris Van den Bossche (Jira) Mon, 03 Aug 2020 05:08:50 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169966#comment-17169966
 ]


Joris Van den Bossche edited comment on ARROW-9573 at 8/3/20, 12:07 PM:
------------------------------------------------------------------------

bq. we *do* ignore directories beginning with "." or "_" in legacy 
{{pq.read_table}} but evidently not when the directory parses as a hive 
partition expression

Indeed, that's done here:  
https://github.com/apache/arrow/blob/9fd11c4e64d05ccb3a11ae891af7e57c815b9379/python/pyarrow/parquet.py#L1022-L1024.
 So a "private" directory is only skipped if it has no "=" in it. A logic that 
is hive-specific and thus indeed seems difficult to generalize in the datasets 
API.

bq. One partial solution I can think of is to add the {{ignore_prefixes}} 
option to {{read_table}} 

That could indeed be a way to give the user some more control (and might be 
useful to expose anyway). 
That still won't solve a default roundtrip of course, and also won't fix the 
case where the user specifies the partitioning names explicitly. Both would 
still require a user action (to specify this keyword), but not sure it is 
possible to solve this without such user action.




was (Author: jorisvandenbossche):
bq. we *do* ignore directories beginning with "." or "_" in legacy 
{{pq.read_table}} but evidently not when the directory parses as a hive 
partition expression

Indeed, that's done here:  
https://github.com/apache/arrow/blob/9fd11c4e64d05ccb3a11ae891af7e57c815b9379/python/pyarrow/parquet.py#L1022-L1024.
 So a "private" directory is only skipped if it has no "=" in it. A logic that 
is hive-specific and thus indeed seems difficult to generalize in the datasets 
API.

.bq One partial solution I can think of is to add the {{ignore_prefixes}} 
option to {{read_table}} 

That could indeed be a way to give the user some more control (and might be 
useful to expose anyway). 
That still won't solve a default roundtrip of course, and also won't fix the 
case where the user specifies the partitioning names explicitly. Both would 
still require a user action (to specify this keyword), but not sure it is 
possible to solve this without such user action.



> [Python] Parquet doesn't load when partitioned column starts with '_'
> ---------------------------------------------------------------------
>
>                 Key: ARROW-9573
>                 URL: https://issues.apache.org/jira/browse/ARROW-9573
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.0
>            Reporter: Tonnam Balankura
>            Assignee: Ben Kietzman
>            Priority: Major
>
> When the loading parquet with partitioned column that starts with an 
> underscore '_', nothing is loaded. No exceptions are raised either. Loading 
> this parquet have worked for me in pyarrow 0.17.1, but not working anymore in 
> pyarrow 1.0.0.
> On the other hand, loading parquet with a partitioned column starting with 
> '_' is possible by using the `use_legacy_dataset` option. Also, when the 
> column that starts with an underscore is not a partitioned column, loading 
> parquet seems to work as expected.
> {code:python}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import pandas as pd
> >>> df1 = pd.DataFrame(data={'_COL_1': [1, 2], 'COL_2': [3, 4], 'COL_3': [5, 
> >>> 6]})
> >>> table1 = pa.Table.from_pandas(df1)
> >>> pq.write_to_dataset(table1, partition_cols=['_COL_1', 'COL_2'], 
> >>> root_path='test_parquet1')
> >>> df_pq1 = pq.read_table('test_parquet1')
> >>> df_pq1
> pyarrow.Table
> >>> len(df_pq1)
> 0
> >>> df_pq1_legacy = pq.read_table('test_parquet1', use_legacy_dataset=True)
> pyarrow.Table
> COL_3: int64
> _COL_1: dictionary<values=int64, indices=int32, ordered=0>
> COL_2: dictionary<values=int64, indices=int32, ordered=0>
> >>> len(df_pq1_legacy)
> 2
> >>> df2 = pd.DataFrame(data={'COL_1': [1, 2], 'COL_2': [3, 4], '_COL_3': [5, 
> >>> 6]})
> >>> table2 = pa.Table.from_pandas(df2)
> >>> pq.write_to_dataset(table2, partition_cols=['COL_1', 'COL_2'], 
> >>> root_path='test_parquet2')
> >>> df_pq2 = pq.read_table('test_parquet2')
> >>> df_pq2
> pyarrow.Table
> _COL_3: int64
> COL_1: int32
> COL_2: int32
> >>> len(df_pq2)
> 2
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9573) [Python] Parquet doesn't load when partitioned column starts with '_'

Reply via email to