[ https://issues.apache.org/jira/browse/ARROW-9573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17169966#comment-17169966 ]
Joris Van den Bossche edited comment on ARROW-9573 at 8/3/20, 12:07 PM: ------------------------------------------------------------------------ bq. we *do* ignore directories beginning with "." or "_" in legacy {{pq.read_table}} but evidently not when the directory parses as a hive partition expression Indeed, that's done here: https://github.com/apache/arrow/blob/9fd11c4e64d05ccb3a11ae891af7e57c815b9379/python/pyarrow/parquet.py#L1022-L1024. So a "private" directory is only skipped if it has no "=" in it. A logic that is hive-specific and thus indeed seems difficult to generalize in the datasets API. bq. One partial solution I can think of is to add the {{ignore_prefixes}} option to {{read_table}} That could indeed be a way to give the user some more control (and might be useful to expose anyway). That still won't solve a default roundtrip of course, and also won't fix the case where the user specifies the partitioning names explicitly. Both would still require a user action (to specify this keyword), but not sure it is possible to solve this without such user action. was (Author: jorisvandenbossche): bq. we *do* ignore directories beginning with "." or "_" in legacy {{pq.read_table}} but evidently not when the directory parses as a hive partition expression Indeed, that's done here: https://github.com/apache/arrow/blob/9fd11c4e64d05ccb3a11ae891af7e57c815b9379/python/pyarrow/parquet.py#L1022-L1024. So a "private" directory is only skipped if it has no "=" in it. A logic that is hive-specific and thus indeed seems difficult to generalize in the datasets API. .bq One partial solution I can think of is to add the {{ignore_prefixes}} option to {{read_table}} That could indeed be a way to give the user some more control (and might be useful to expose anyway). That still won't solve a default roundtrip of course, and also won't fix the case where the user specifies the partitioning names explicitly. Both would still require a user action (to specify this keyword), but not sure it is possible to solve this without such user action. > [Python] Parquet doesn't load when partitioned column starts with '_' > --------------------------------------------------------------------- > > Key: ARROW-9573 > URL: https://issues.apache.org/jira/browse/ARROW-9573 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.0 > Reporter: Tonnam Balankura > Assignee: Ben Kietzman > Priority: Major > > When the loading parquet with partitioned column that starts with an > underscore '_', nothing is loaded. No exceptions are raised either. Loading > this parquet have worked for me in pyarrow 0.17.1, but not working anymore in > pyarrow 1.0.0. > On the other hand, loading parquet with a partitioned column starting with > '_' is possible by using the `use_legacy_dataset` option. Also, when the > column that starts with an underscore is not a partitioned column, loading > parquet seems to work as expected. > {code:python} > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> import pandas as pd > >>> df1 = pd.DataFrame(data={'_COL_1': [1, 2], 'COL_2': [3, 4], 'COL_3': [5, > >>> 6]}) > >>> table1 = pa.Table.from_pandas(df1) > >>> pq.write_to_dataset(table1, partition_cols=['_COL_1', 'COL_2'], > >>> root_path='test_parquet1') > >>> df_pq1 = pq.read_table('test_parquet1') > >>> df_pq1 > pyarrow.Table > >>> len(df_pq1) > 0 > >>> df_pq1_legacy = pq.read_table('test_parquet1', use_legacy_dataset=True) > pyarrow.Table > COL_3: int64 > _COL_1: dictionary<values=int64, indices=int32, ordered=0> > COL_2: dictionary<values=int64, indices=int32, ordered=0> > >>> len(df_pq1_legacy) > 2 > >>> df2 = pd.DataFrame(data={'COL_1': [1, 2], 'COL_2': [3, 4], '_COL_3': [5, > >>> 6]}) > >>> table2 = pa.Table.from_pandas(df2) > >>> pq.write_to_dataset(table2, partition_cols=['COL_1', 'COL_2'], > >>> root_path='test_parquet2') > >>> df_pq2 = pq.read_table('test_parquet2') > >>> df_pq2 > pyarrow.Table > _COL_3: int64 > COL_1: int32 > COL_2: int32 > >>> len(df_pq2) > 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)