[
https://issues.apache.org/jira/browse/ARROW-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802291#comment-16802291
]
Matthew Rocklin commented on ARROW-4139:
----------------------------------------
Perhaps relatedly, it would be useful to get physical types for statistics of
other types. Here is another issue I ran into with timestamps
{code}
In [1]: import pandas as pd
In [2]: df = pd.util.testing.makeTimeDataFrame()
In [3]: df.head()
Out[3]:
A B C D
2000-01-03 1.255856 -1.092558 -1.454595 0.898535
2000-01-04 -1.006590 0.640467 -2.249877 0.068293
2000-01-05 -1.525559 0.567070 1.039230 -0.967301
2000-01-06 -0.773395 -1.565619 0.025786 0.106949
2000-01-07 -0.079000 0.367165 1.746211 -0.097441
In [4]: df.to_parquet('foo.parquet')
In [5]: import pyarrow.parquet as pq
In [6]: p = pq.ParquetDataset('foo.parquet')
In [7]: piece = p.pieces[0]
In [8]: md = piece.get_metadata(open_file_func=lambda fn: open(fn, mode='rb'))
In [9]: rg = md.row_group(0)
In [10]: rg.column(4)
Out[10]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x1203d0670>
file_offset: 1913
file_path:
physical_type: INT64
num_values: 30
path_in_schema: __index_level_0__
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x11f821af8>
has_min_max: True
min: 946857600000
max: 950227200000
null_count: 0
distinct_count: 0
num_values: 30
physical_type: INT64
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 1635
data_page_offset: 1842
total_compressed_size: 278
total_uncompressed_size: 325
In [11]: rg.column(4).statistics.min # I want this to be some sort of
timestamp object
Out[11]: 946857600000
{code}
Somewhat unrelatedly, there is a lot of boiler plate to get down to that
information. If there are nicer ways to get to statistics I'd be interested in
hearing about them.
> [Python] Cast Parquet column statistics to unicode if UTF8 ConvertedType is
> set
> -------------------------------------------------------------------------------
>
> Key: ARROW-4139
> URL: https://issues.apache.org/jira/browse/ARROW-4139
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Matthew Rocklin
> Priority: Minor
> Labels: parquet, python
> Fix For: 0.14.0
>
>
> When writing Pandas data to Parquet format and reading it back again I find
> that that statistics of text columns are stored as byte arrays rather than as
> unicode text.
> I'm not sure if this is a bug in Arrow, PyArrow, or just in my understanding
> of how best to manage statistics. (I'd be quite happy to learn that it was
> the latter).
> Here is a minimal example
> {code:python}
> import pandas as pd
> df = pd.DataFrame({'x': ['a']})
> df.to_parquet('df.parquet')
> import pyarrow.parquet as pq
> pf = pq.ParquetDataset('df.parquet')
> piece = pf.pieces[0]
> rg = piece.row_group(0)
> md = piece.get_metadata(pq.ParquetFile)
> rg = md.row_group(0)
> c = rg.column(0)
> >>> c
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd1a377c238>
> file_offset: 63
> file_path:
> physical_type: BYTE_ARRAY
> num_values: 1
> path_in_schema: x
> is_stats_set: True
> statistics:
> <pyarrow._parquet.RowGroupStatistics object at 0x7fd1a37d4418>
> has_min_max: True
> min: b'a'
> max: b'a'
> null_count: 0
> distinct_count: 0
> num_values: 1
> physical_type: BYTE_ARRAY
> compression: SNAPPY
> encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
> has_dictionary_page: True
> dictionary_page_offset: 4
> data_page_offset: 25
> total_compressed_size: 59
> total_uncompressed_size: 55
> >>> type(c.statistics.min)
> bytes
> {code}
> My guess is that we would want to store a logical type in the statistics like
> UNICODE, though I don't have enough experience with Parquet data types to
> know if this is a good idea or possible.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)