[ 
https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16822085#comment-16822085
 ] 

Matthew Rocklin commented on ARROW-4139:
----------------------------------------

I don't have strong thoughts about the API.  I mostly care that downstream 
projects don't have to special case a variety of types.  I expect that that 
special casing will be quite brittle and break over time.  Ideally we wouldn't 
be special casing these things in Arrow either.  

My guess is that there is already code somewhere in Arrow that knows how to 
convert these values in a consistent way.  Hopefully that same code would be 
engaged for statistics as well, that way we would have confidence that things 
wouldn't drift in the future.


> [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is 
> set
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-4139
>                 URL: https://issues.apache.org/jira/browse/ARROW-4139
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Matthew Rocklin
>            Priority: Minor
>              Labels: parquet, python
>             Fix For: 0.14.0
>
>
> When writing Pandas data to Parquet format and reading it back again I find 
> that that statistics of text columns are stored as byte arrays rather than as 
> unicode text. 
> I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding 
> of how best to manage statistics.  (I'd be quite happy to learn that it was 
> the latter).
> Here is a minimal example
> {code:python}
> import pandas as pd
> df = pd.DataFrame({'x': ['a']})
> df.to_parquet('df.parquet')
> import pyarrow.parquet as pq
> pf = pq.ParquetDataset('df.parquet')
> piece = pf.pieces[0]
> rg = piece.row_group(0)
> md = piece.get_metadata(pq.ParquetFile)
> rg = md.row_group(0)
> c = rg.column(0)
> >>> c
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238>
>   file_offset: 63
>   file_path: 
>   physical_type: BYTE_ARRAY
>   num_values: 1
>   path_in_schema: x
>   is_stats_set: True
>   statistics:
>     <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418>
>       has_min_max: True
>       min: b'a'
>       max: b'a'
>       null_count: 0
>       distinct_count: 0
>       num_values: 1
>       physical_type: BYTE_ARRAY
>   compression: SNAPPY
>   encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
>   has_dictionary_page: True
>   dictionary_page_offset: 4
>   data_page_offset: 25
>   total_compressed_size: 59
>   total_uncompressed_size: 55
> >>> type(c.statistics.min)
> bytes
> {code}
> My guess is that we would want to store a logical type in the statistics like 
> UNICODE, though I don't have enough experience with Parquet data types to 
> know if this is a good idea or possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to