Ryan Blue created PARQUET-62:
--------------------------------

             Summary: DictionaryValuesWriter dictionaries are corrupted by user 
changes.
                 Key: PARQUET-62
                 URL: https://issues.apache.org/jira/browse/PARQUET-62
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
            Reporter: Ryan Blue
            Assignee: Ryan Blue
            Priority: Blocker


DictionaryValuesWriter passes incoming Binary objects directly to Object2IntMap 
to accumulate dictionary values. If the arrays backing the Binary objects 
passed in are reused by the caller, then the values are corrupted but still 
written without an error.

Because Hadoop reuses objects passed to mappers and reducers, this can happen 
easily. For example, Avro reuses the byte arrays backing Utf8 objects, which 
parquet-avro passes wrapped in a Binary object to writeBytes.

The fix is to make defensive copies of the values passed to the Dictionary 
writer code. I think this only affects the Binary dictionary classes because 
Strings, floats, longs, etc. are immutable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to