[
https://issues.apache.org/jira/browse/ARROW-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385404#comment-16385404
]
ASF GitHub Bot commented on ARROW-1982:
---------------------------------------
wesm opened a new pull request #1698: ARROW-1982: [Python] Coerce Parquet
statistics as bytes to more useful Python scalar types
URL: https://github.com/apache/arrow/pull/1698
I also changed the BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY to return bytes since
decoding from binary to UTF8 unicode didn't seem correct to me as the default
behavior
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] Return parquet statistics min/max as values instead of strings
> -----------------------------------------------------------------------
>
> Key: ARROW-1982
> URL: https://issues.apache.org/jira/browse/ARROW-1982
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Jim Crist
> Assignee: Wes McKinney
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently `min` and `max` column statistics are returned as formatted strings
> of the _physical type_. This makes using them in python a bit tricky, as the
> strings need to be parsed as the proper _logical type_. Observe:
> {code}
> In [20]: import pandas as pd
> In [21]: df = pd.DataFrame({'a': [1, 2, 3],
> ...: 'b': ['a', 'b', 'c'],
> ...: 'c': [pd.Timestamp('1991-01-01')]*3})
> ...:
> In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
> In [23]: from pyarrow import parquet as pq
> In [24]: f = pq.ParquetFile('temp.parquet')
> In [25]: rg = f.metadata.row_group(0)
> In [26]: rg.column(0).statistics.min # string instead of integer
> Out[26]: '1'
> In [27]: rg.column(1).statistics.min # weird space added after value due to
> formatter
> Out[27]: 'a '
> In [28]: rg.column(2).statistics.min # formatted as physical type (int)
> instead of logical (datetime)
> Out[28]: '662688000000'
> {code}
> Since the type information is known, it should be possible to convert these
> to arrow values instead of strings.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)