Matthew Rocklin created ARROW-4139:
--------------------------------------
Summary: Parquet Statistics on unicode text files have byte array
type
Key: ARROW-4139
URL: https://issues.apache.org/jira/browse/ARROW-4139
Project: Apache Arrow
Issue Type: Bug
Reporter: Matthew Rocklin
When writing Pandas data to Parquet format and reading it back again I find
that that statistics of text columns are stored as byte arrays rather than as
unicode text.
I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding of
how best to manage statistics. (I'd be quite happy to learn that it was the
latter).
Here is a minimal example
{code:python}
import pandas as pd
df = pd.DataFrame({'x': ['a']})
df.to_parquet('df.parquet')
import pyarrow.parquet as pq
pf = pq.ParquetDataset('df.parquet')
piece = pf.pieces[0]
rg = piece.row_group(0)
md = piece.get_metadata(pq.ParquetFile)
rg = md.row_group(0)
c = rg.column(0)
>>> c
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238>
file_offset: 63
file_path:
physical_type: BYTE_ARRAY
num_values: 1
path_in_schema: x
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418>
has_min_max: True
min: b'a'
max: b'a'
null_count: 0
distinct_count: 0
num_values: 1
physical_type: BYTE_ARRAY
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 25
total_compressed_size: 59
total_uncompressed_size: 55
>>> type(c.statistics.min)
bytes
{code}
My guess is that we would want to store a logical type in the statistics like
UNICODE, though I don't have enough experience with Parquet data types to know
if this is a good idea or possible.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)