[ 
https://issues.apache.org/jira/browse/ORC-222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved ORC-222.
-------------------------------
    Resolution: Not A Problem

The comparisons are being done over the UTF-8 bytes. Dain was looking at the 
old code and missed the change that had fixed this.

> StringStatisticsImpl munges min/max during write
> ------------------------------------------------
>
>                 Key: ORC-222
>                 URL: https://issues.apache.org/jira/browse/ORC-222
>             Project: ORC
>          Issue Type: Bug
>          Components: encoding
>            Reporter: Dain Sundstrom
>            Priority: Critical
>
> String statistics are collected using Text which compares using raw bytes 
> which are assumed to be UTF-8.  When the input contains invalid UTF-8 
> sequences and these are the min or max value, the writer converts these 
> invalid sequences to a java.lang.String, which replaces invalid UTF-8 
> sequences with the replacement character (0xFFFD).  This conversion happens 
> here:
> https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L611
> To work around this issue, the writer should use the `setMinimumBytes` 
> Protocol Buffers API instead.
> The same issue exists during read, where the bytes are round tripped through 
> java.lang.String.  The read code is here:
> https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L528



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to