[ 
https://issues.apache.org/jira/browse/PARQUET-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555870#comment-15555870
 ] 

Patrick Woody commented on PARQUET-743:
---------------------------------------

Copied from mailing list
{code}
Hey all,

Running a parquet-mr build off of master and I'm seeing some interesting 
behavior when using a DictionaryFilter to prune row groups. Basically, if I 
have an And or Or filter the DictionaryPage object gets re-used. This seems to 
be a problem for StreamBytesInput because the stream gets exhausted after the 
first toByteArray call. My current workaround is to synchronize and just re-use 
the byte array after the first read, but I'd be curious as to what people think 
the best approach to solving this is and if we should be reusing the BytesInput 
at all.

Best,
Patrick 


---

Looking a bit more - it looks like this is because decompression converts to a 
StreamBytesInput automatically. The current tests run with the uncompressed 
codec, so it doesn't hit this issue. I've put up a commit here that 
demonstrates the issue and my current workaround: 
https://github.com/palantir/parquet-mr/pull/10/commits/70cc00cba5c294d4c860bd4cd2c48c2d083a5809.

Thanks,
{code}

> DictionaryFilters can re-use StreamBytesInput when compressed
> -------------------------------------------------------------
>
>                 Key: PARQUET-743
>                 URL: https://issues.apache.org/jira/browse/PARQUET-743
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.9.0
>            Reporter: Patrick Woody
>
> When using an And or Or DictionaryFilter, we re-use the BytesInput across 
> reads. This is problematic when compressed because compressed BytesInputs get 
> converted over to StreamBytesInputs which can only be used once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to