Hi Ryan, Apologies for the delay! I've filed it here https://issues.apache.org/jira/browse/PARQUET-743 with the information from the thread.
Thanks Patrick On Wed, Oct 5, 2016 at 4:39 PM, Ryan Blue <[email protected]> wrote: > Patrick, > > Can you please open an issue for this? I think we should fix this before > the 1.9.0 release. Thanks! > > rb > > On Tue, Oct 4, 2016 at 11:58 AM, Patrick Woody <[email protected]> > wrote: > > > Looking a bit more - it looks like this is because decompression converts > > to a StreamBytesInput automatically. The current tests run with the > > uncompressed codec, so it doesn't hit this issue. I've put up a commit > here > > that demonstrates the issue and my current workaround: > > https://github.com/palantir/parquet-mr/pull/10/commits/ > > 70cc00cba5c294d4c860bd4cd2c48c2d083a5809 > > . > > > > Thanks, > > Patrick > > > > On Tue, Oct 4, 2016 at 4:33 PM, Patrick Woody <[email protected]> > > wrote: > > > > > Hey all, > > > > > > Running a parquet-mr build off of master and I'm seeing some > interesting > > > behavior when using a DictionaryFilter to prune row groups. Basically, > > if I > > > have an And or Or filter the DictionaryPage object gets re-used. This > > seems > > > to be a problem for StreamBytesInput because the stream gets exhausted > > > after the first toByteArray call. My current workaround is to > synchronize > > > and just re-use the byte array after the first read, but I'd be curious > > as > > > to what people think the best approach to solving this is and if we > > should > > > be reusing the BytesInput at all. > > > > > > Best, > > > Patrick > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix >
