[
https://issues.apache.org/jira/browse/ORC-222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Owen O'Malley resolved ORC-222.
-------------------------------
Resolution: Not A Problem
The comparisons are being done over the UTF-8 bytes. Dain was looking at the
old code and missed the change that had fixed this.
> StringStatisticsImpl munges min/max during write
> ------------------------------------------------
>
> Key: ORC-222
> URL: https://issues.apache.org/jira/browse/ORC-222
> Project: ORC
> Issue Type: Bug
> Components: encoding
> Reporter: Dain Sundstrom
> Priority: Critical
>
> String statistics are collected using Text which compares using raw bytes
> which are assumed to be UTF-8. When the input contains invalid UTF-8
> sequences and these are the min or max value, the writer converts these
> invalid sequences to a java.lang.String, which replaces invalid UTF-8
> sequences with the replacement character (0xFFFD). This conversion happens
> here:
> https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L611
> To work around this issue, the writer should use the `setMinimumBytes`
> Protocol Buffers API instead.
> The same issue exists during read, where the bytes are round tripped through
> java.lang.String. The read code is here:
> https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L528
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)