[ https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822085#comment-16822085 ]
Matthew Rocklin commented on ARROW-4139: ---------------------------------------- I don't have strong thoughts about the API. I mostly care that downstream projects don't have to special case a variety of types. I expect that that special casing will be quite brittle and break over time. Ideally we wouldn't be special casing these things in Arrow either. My guess is that there is already code somewhere in Arrow that knows how to convert these values in a consistent way. Hopefully that same code would be engaged for statistics as well, that way we would have confidence that things wouldn't drift in the future. > [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is > set > ------------------------------------------------------------------------------- > > Key: ARROW-4139 > URL: https://issues.apache.org/jira/browse/ARROW-4139 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Matthew Rocklin > Priority: Minor > Labels: parquet, python > Fix For: 0.14.0 > > > When writing Pandas data to Parquet format and reading it back again I find > that that statistics of text columns are stored as byte arrays rather than as > unicode text. > I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding > of how best to manage statistics. (I'd be quite happy to learn that it was > the latter). > Here is a minimal example > {code:python} > import pandas as pd > df = pd.DataFrame({'x': ['a']}) > df.to_parquet('df.parquet') > import pyarrow.parquet as pq > pf = pq.ParquetDataset('df.parquet') > piece = pf.pieces[0] > rg = piece.row_group(0) > md = piece.get_metadata(pq.ParquetFile) > rg = md.row_group(0) > c = rg.column(0) > >>> c > <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238> > file_offset: 63 > file_path: > physical_type: BYTE_ARRAY > num_values: 1 > path_in_schema: x > is_stats_set: True > statistics: > <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418> > has_min_max: True > min: b'a' > max: b'a' > null_count: 0 > distinct_count: 0 > num_values: 1 > physical_type: BYTE_ARRAY > compression: SNAPPY > encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE') > has_dictionary_page: True > dictionary_page_offset: 4 > data_page_offset: 25 > total_compressed_size: 59 > total_uncompressed_size: 55 > >>> type(c.statistics.min) > bytes > {code} > My guess is that we would want to store a logical type in the statistics like > UNICODE, though I don't have enough experience with Parquet data types to > know if this is a good idea or possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)