[ 
https://issues.apache.org/jira/browse/PARQUET-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1547:
----------------------------------
    Summary: [C++] Detect parquet-mr style dictionary_page  (was: Detect 
parquet-mr style dictionary_page)

> [C++] Detect parquet-mr style dictionary_page
> ---------------------------------------------
>
>                 Key: PARQUET-1547
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1547
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: colin fang
>            Priority: Minor
>
> parquet-mr incorrectly writes (dictionary_page_offset, 
> first_data_page_offset) as (0, dictionary_page_offset)
> So whenever parquet-cpp (pyarrow) reads the file, it sets 
> `has_dictionary_page: False` and `dictionary_page_offset: None`
> {code}
> row group 0 
> --------------------------------------------------------------------------------
> x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 
> ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
> y:  BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 
> ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
>     x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
>     
> ----------------------------------------------------------------------------
>     page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY 
> ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
> {code}
> {code}
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
>   file_offset: 4
>   file_path: 
>   physical_type: DOUBLE
>   num_values: 70000
>   path_in_schema: x
>   is_stats_set: True
>   statistics:
>     <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
>       has_min_max: True
>       min: 1.0
>       max: 5.0
>       null_count: 10000
>       distinct_count: 0
>       num_values: 60000
>       physical_type: DOUBLE
>   compression: SNAPPY
>   encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
>   has_dictionary_page: False
>   dictionary_page_offset: None
>   data_page_offset: 4
>   total_compressed_size: 1632
>   total_uncompressed_size: 31635
> {code}
> Is parquet-cpp still able to use the dictionary in this case?
> It would be nice if parquet-cpp can recognize the parquet-mr issue and set 
> `has_dictionary_page` to True.
> https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to