[jira] [Commented] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

Julien Le Dem (JIRA) Mon, 18 May 2015 17:56:23 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14549590#comment-14549590
 ]


Julien Le Dem commented on PARQUET-251:
---------------------------------------

The main goal of the {{Binary}} class is to abstract out over the bytes 
representation, whether if it's a {{byte[]}} a subsection of an existing array 
or something else. The idea is to avoid copying bytes as much as possible when 
those needs to passed several layers in and possibly logically split or 
concatenated before the final reading or writing.
Depending on what we do with the bytes and where they are coming from, we want 
to either copy them or get the existing array. We certainly want to avoid 
unnecessary copies.

this depends on two things:
 - whether the producer of the {{Binary}} object intends to reuse the 
underlying bytes (if it is backed by a buffer) or they are actually never 
mutated anymore (only the Binary object as a reference to it)
 - whether the consumer of the {{Binary}} object intends to hold onto it 
(dictionaries, stats) or not (it just writes it to a buffer).

We can capture this at the producer level by differentiating between mutable 
and immutable bytes.
{{Binary.fromMutable(...)}} or {{Binary.fromImmutable(...)}}
Then on the consumer side we use {{getBackingBytes()}} if we don't intend to 
hold on it outside the current scope and {{copy()}} if we do. 
Now {{ImmutableBinary}} can implement {{copy()}} as just {{return this}}.

I would prefer that we don't call the new method {{getBytesUnsafe}} as it is 
not unsafe if used in the right condition (not using the resulting bytes 
outside the current scope). I prefer {{getBackingBytes}} that reflects better 
the intent. +1 on making {{getBytes()}} a synonym of {{copy()}}



> Binary column statistics error when reuse byte[] among rows
> -----------------------------------------------------------
>
>                 Key: PARQUET-251
>                 URL: https://issues.apache.org/jira/browse/PARQUET-251
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Yijie Shen
>            Assignee: Ashish K Singh
>            Priority: Blocker
>
> I think it is a common practice when inserting table data as parquet file, 
> one would always reuse the same object among rows, and if a column is byte[] 
> of fixed length, the byte[] would also be reused. 
> If I use ByteArrayBackedBinary for my byte[], the bug occurs: All of the row 
> groups created by a single task would have the same max & min binary value, 
> just as the last row's binary content.
> The reason is BinaryStatistic just keep max & min as parquet.io.api.Binary 
> references, since I use ByteArrayBackedBinary for byte[], the real content of 
> max & min would always point to the reused byte[], therefore the latest row's 
> content.
> Does parquet declare somewhere that the user shouldn't reuse byte[] for 
> Binary type?  If it doesn't, I think it's a bug and can be reproduced by 
> [Spark SQL's RowWriteSupport 
> |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L353-354]
> The related Spark JIRA ticket: 
> [SPARK-6859|https://issues.apache.org/jira/browse/SPARK-6859]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-251) Binary column statistics error when reuse byte[] among rows

Reply via email to