[ https://issues.apache.org/jira/browse/PARQUET-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated PARQUET-1547: ---------------------------------- Summary: [C++] Detect parquet-mr style dictionary_page (was: Detect parquet-mr style dictionary_page) > [C++] Detect parquet-mr style dictionary_page > --------------------------------------------- > > Key: PARQUET-1547 > URL: https://issues.apache.org/jira/browse/PARQUET-1547 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Reporter: colin fang > Priority: Minor > > parquet-mr incorrectly writes (dictionary_page_offset, > first_data_page_offset) as (0, dictionary_page_offset) > So whenever parquet-cpp (pyarrow) reads the file, it sets > `has_dictionary_page: False` and `dictionary_page_offset: None` > {code} > row group 0 > -------------------------------------------------------------------------------- > x: DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 > ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] > y: BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 > ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000] > x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY > > ---------------------------------------------------------------------------- > page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY > ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000 > {code} > {code} > <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120> > file_offset: 4 > file_path: > physical_type: DOUBLE > num_values: 70000 > path_in_schema: x > is_stats_set: True > statistics: > <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0> > has_min_max: True > min: 1.0 > max: 5.0 > null_count: 10000 > distinct_count: 0 > num_values: 60000 > physical_type: DOUBLE > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED') > has_dictionary_page: False > dictionary_page_offset: None > data_page_offset: 4 > total_compressed_size: 1632 > total_uncompressed_size: 31635 > {code} > Is parquet-cpp still able to use the dictionary in this case? > It would be nice if parquet-cpp can recognize the parquet-mr issue and set > `has_dictionary_page` to True. > https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/ -- This message was sent by Atlassian JIRA (v7.6.14#76016)