[jira] [Commented] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

Ryan Blue (JIRA) Tue, 19 May 2015 10:21:44 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550807#comment-14550807
 ]


Ryan Blue commented on PARQUET-251:
-----------------------------------

I agree with your breakdown of the dependencies, but I don't think 
Mutable/Immutable is the right distinction here. This violates the contract of 
[{{Immutable}}|https://jsr-305.googlecode.com/svn/trunk/javadoc/javax/annotation/concurrent/Immutable.html]:
 Immutable means ". . . that methods do not publish references to any internal 
state which is mutable by implementation even if not by design".

We certainly intend for those bytes to not be changed, but there is no way to 
control that this is the case. With the proposed {{ImmutableBinary}} API, it 
would be easy for someone to accidentally mutate the backing bytes with 
entirely reasonable operations: {{copy}} appears to return a copy of the bytes, 
so it seems safe to reuse and update. I think a {{copy}} method should reliably 
copy or return an object that is immutable.

The API only needs to reflect the consumer's intent because it doesn't 
necessarily know where the binary came from. I agree it makes sense to have a 
flag or separate class to signal the producer's intent when it creates the 
Binary.

I think what Ashish has currently works. We have to assume that the caller 
isn't thinking about the bytes changing in the background, so {{getBytes}} 
making a copy is definitely a requirement. I like {{getBytesUnsafe}} because it 
makes the caller either avoid it or be careful. And I think that is in line 
with the typical use of "unsafe": be careful because you can do something dumb. 
If it were completely unsafe, we wouldn't expose the method.

> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Assignee: Ashish K Singh
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, 
> one would always reuse the same object among rows, and if a column is byte[] 
> of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
> groups created by a single task would have the same max & min binary value, 
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
> references, since I use ByteArrayBackedBinary for byte[], the real content of 
> max & min would always point to the reused byte[], therefore the latest row's 
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for 
> Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
> [Spark SQL's RowWriteSupport 
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: 
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

Reply via email to