[jira] [Commented] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

Ashish K Singh (JIRA) Fri, 08 May 2015 10:44:39 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14535060#comment-14535060
 ]


Ashish K Singh commented on PARQUET-251:
----------------------------------------

[~alexlevenson] I agree with you that we will have to watch out for performance 
regression. However, having ImmutableBinary and MutableBinary still requires 
clients to use the right version of it, which will be true even if we just have 
getBytes() return the copy. 

In current state, getBytes() acts differently for different implementations. 
For some implementations it returns a copy and for some it returns the backing 
array. I believe it will be best to fix this and have a standard definition of 
getBytes. Once we know for sure that getBytes will always return a copy, we 
should to fix usages of the getBytes(). For example, 
AvroIndexedRecordCoverter.FieldBytesConverter call getBytes and then create a 
new buffer from those bytes, so there is a redundant copy happening for sure. 
However, I think it would be scope creep if we plan on doing it as part of this 
JIRA. We should create a separate JIRA with a clear intent. Would you agree?

> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Assignee: Ashish K Singh
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, 
> one would always reuse the same object among rows, and if a column is byte[] 
> of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
> groups created by a single task would have the same max & min binary value, 
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
> references, since I use ByteArrayBackedBinary for byte[], the real content of 
> max & min would always point to the reused byte[], therefore the latest row's 
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for 
> Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
> [Spark SQL's RowWriteSupport 
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: 
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

Reply via email to