This actually is a lot simpler and turns out that dictionary filter is broken when compression is enabled.
I think Pat’s change sounds like a good fix if we really want to get release out. Otherwise we probably should refactor the code to not pass BytesInput around as the code comment suggests. - Robert On 10/4/16, 7:58 PM, "Patrick Woody" <[email protected]> wrote: Looking a bit more - it looks like this is because decompression converts to a StreamBytesInput automatically. The current tests run with the uncompressed codec, so it doesn't hit this issue. I've put up a commit here that demonstrates the issue and my current workaround: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_parquet-2Dmr_pull_10_commits_70cc00cba5c294d4c860bd4cd2c48c2d083a5809&d=DQIBaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Gukiqwaa9M7VDsJzd0J3W7mh_DfC1XlLxRRhg4t2Xyc&m=MSfZw5Y5MUMata_hsxWYGnz_CIrLv4WUK6qmRXNBwOk&s=7L03R22O3zRmlOpvjZF-sX0Qny7cJjxPrl3RM-GuMcg&e= . Thanks, Patrick On Tue, Oct 4, 2016 at 4:33 PM, Patrick Woody <[email protected]> wrote: > Hey all, > > Running a parquet-mr build off of master and I'm seeing some interesting > behavior when using a DictionaryFilter to prune row groups. Basically, if I > have an And or Or filter the DictionaryPage object gets re-used. This seems > to be a problem for StreamBytesInput because the stream gets exhausted > after the first toByteArray call. My current workaround is to synchronize > and just re-use the byte array after the first read, but I'd be curious as > to what people think the best approach to solving this is and if we should > be reusing the BytesInput at all. > > Best, > Patrick >
smime.p7s
Description: S/MIME cryptographic signature
