Jason, You were partly correct. We are not dropping records however we are corrupting dictionary encoded binary columns. I got confused that we are returning different records, but we are trimming (or returning unreadable chars) some columns which are binary. I was able to reproduce with the lineitem data set. I will raise a jira and I think this should be treated critical. Thoughts?
- Rahul On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli < [email protected]> wrote: > Jason, > > I missed that. Let me check whether we are dropping any records. I would > be surprised if our regression tests missed that :) > > - Rahul > > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <[email protected]> > wrote: > >> Rahul, >> >> Thanks for working on a reproduction of the issue. You didn't actually >> answer my first question, are you getting the same data out of the file, >> just in a different order? It seems much more likely that we are dropping >> some records at the beginning than reordering them somehow, although I >> would have expected an error like this to be caught by the unit or >> regression tests. >> >> Thanks, >> Jason >> >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli < >> [email protected]> wrote: >> >> > Thanks for your replies. The file is private and I will try to >> construct a >> > file without sensitive data which can expose this behavior. >> > >> > - Rahul >> > >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse < >> [email protected]> >> > wrote: >> > >> > > Is this a large or private parquet file? Can you share it to allow me >> to >> > > debug the read path for it? >> > > >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse < >> > [email protected]> >> > > wrote: >> > > >> > > > The changes to parquet were not supposed to be functional at all. We >> > had >> > > > been maintaining our fork of parquet-mr to have a ByteBuffer based >> read >> > > and >> > > > write path to reduce heap memory usage. The work done was just >> getting >> > > > these changes merged back into parquet-mr and making corresponding >> > > changes >> > > > in Drill to accommodate any interface modifications introduced >> since we >> > > > last rebased (there were mostly just package renames). There were a >> lot >> > > of >> > > > comments on the PR, and a decent amount of refactoring that was >> done to >> > > > consolidate and otherwise clean up the code, but there shouldn't >> have >> > > been >> > > > any changes to the behavior of the reader or writer. >> > > > >> > > > Are you getting all of the same data out if you read the whole file, >> > just >> > > > in a different order? >> > > > >> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli < >> > > > [email protected]> wrote: >> > > > >> > > >> parquet-meta command suggests that there is only one row group >> > > >> >> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <[email protected] >> > >> > > >> wrote: >> > > >> >> > > >> > How many row groups? >> > > >> > >> > > >> > -- >> > > >> > Jacques Nadeau >> > > >> > CTO and Co-Founder, Dremio >> > > >> > >> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli < >> > > >> > [email protected]> wrote: >> > > >> > >> > > >> > > Drillers, >> > > >> > > >> > > >> > > With the new parquet library update, can someone throw some >> light >> > on >> > > >> the >> > > >> > > order in which the records are read from a single parquet file? >> > > >> > > >> > > >> > > With the older library, when I run the below query on a single >> > > parquet >> > > >> > > file, I used to get a set of records. Now after the parquet >> > library >> > > >> > update, >> > > >> > > I am seeing a different set of records. Just wanted to >> understand >> > > what >> > > >> > > specifically has changed. >> > > >> > > >> > > >> > > select * from `file.parquet` limit 5; >> > > >> > > >> > > >> > > - Rahul >> > > >> > > >> > > >> > >> > > >> >> > > > >> > > > >> > > >> > >> > >
