[
https://issues.apache.org/jira/browse/PARQUET-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15555870#comment-15555870
]
Patrick Woody commented on PARQUET-743:
---------------------------------------
Copied from mailing list
{code}
Hey all,
Running a parquet-mr build off of master and I'm seeing some interesting
behavior when using a DictionaryFilter to prune row groups. Basically, if I
have an And or Or filter the DictionaryPage object gets re-used. This seems to
be a problem for StreamBytesInput because the stream gets exhausted after the
first toByteArray call. My current workaround is to synchronize and just re-use
the byte array after the first read, but I'd be curious as to what people think
the best approach to solving this is and if we should be reusing the BytesInput
at all.
Best,
Patrick
---
Looking a bit more - it looks like this is because decompression converts to a
StreamBytesInput automatically. The current tests run with the uncompressed
codec, so it doesn't hit this issue. I've put up a commit here that
demonstrates the issue and my current workaround:
https://github.com/palantir/parquet-mr/pull/10/commits/70cc00cba5c294d4c860bd4cd2c48c2d083a5809.
Thanks,
{code}
> DictionaryFilters can re-use StreamBytesInput when compressed
> -------------------------------------------------------------
>
> Key: PARQUET-743
> URL: https://issues.apache.org/jira/browse/PARQUET-743
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.9.0
> Reporter: Patrick Woody
>
> When using an And or Or DictionaryFilter, we re-use the BytesInput across
> reads. This is problematic when compressed because compressed BytesInputs get
> converted over to StreamBytesInputs which can only be used once.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)