Re: Order of records read in a parquet file

rahul challapalli Fri, 06 Nov 2015 17:32:33 -0800

Jason,

You were partly correct. We are not dropping records however we are
corrupting dictionary encoded binary columns. I got confused that we are
returning different records, but we are trimming (or returning unreadable
chars) some columns which are binary. I was able to reproduce with the
lineitem data set. I will raise a jira and I think this should be treated
critical. Thoughts?


- Rahul

On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
[email protected]> wrote:

> Jason,
>
> I missed that. Let me check whether we are dropping any records. I would
> be surprised if our regression tests missed that :)
>
> - Rahul
>
> On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <[email protected]>
> wrote:
>
>> Rahul,
>>
>> Thanks for working on a reproduction of the issue. You didn't actually
>> answer my first question, are you getting the same data out of the file,
>> just in a different order? It seems much more likely that we are dropping
>> some records at the beginning than reordering them somehow, although I
>> would have expected an error like this to be caught by the unit or
>> regression tests.
>>
>> Thanks,
>> Jason
>>
>> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
>> [email protected]> wrote:
>>
>> > Thanks for your replies. The file is private and I will try to
>> construct a
>> > file without sensitive data which can expose this behavior.
>> >
>> > - Rahul
>> >
>> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
>> [email protected]>
>> > wrote:
>> >
>> > > Is this a large or private parquet file? Can you share it to allow me
>> to
>> > > debug the read path for it?
>> > >
>> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
>> > [email protected]>
>> > > wrote:
>> > >
>> > > > The changes to parquet were not supposed to be functional at all. We
>> > had
>> > > > been maintaining our fork of parquet-mr to have a ByteBuffer based
>> read
>> > > and
>> > > > write path to reduce heap memory usage. The work done was just
>> getting
>> > > > these changes merged back into parquet-mr and making corresponding
>> > > changes
>> > > > in Drill to accommodate any interface modifications introduced
>> since we
>> > > > last rebased (there were mostly just package renames). There were a
>> lot
>> > > of
>> > > > comments on the PR, and a decent amount of refactoring that was
>> done to
>> > > > consolidate and otherwise clean up the code, but there shouldn't
>> have
>> > > been
>> > > > any changes to the behavior of the reader or writer.
>> > > >
>> > > > Are you getting all of the same data out if you read the whole file,
>> > just
>> > > > in a different order?
>> > > >
>> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
>> > > > [email protected]> wrote:
>> > > >
>> > > >> parquet-meta command suggests that there is only one row group
>> > > >>
>> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <[email protected]
>> >
>> > > >> wrote:
>> > > >>
>> > > >> > How many row groups?
>> > > >> >
>> > > >> > --
>> > > >> > Jacques Nadeau
>> > > >> > CTO and Co-Founder, Dremio
>> > > >> >
>> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
>> > > >> > [email protected]> wrote:
>> > > >> >
>> > > >> > > Drillers,
>> > > >> > >
>> > > >> > > With the new parquet library update, can someone throw some
>> light
>> > on
>> > > >> the
>> > > >> > > order in which the records are read from a single parquet file?
>> > > >> > >
>> > > >> > > With the older library, when I run the below query on a single
>> > > parquet
>> > > >> > > file, I used to get a set of records. Now after the parquet
>> > library
>> > > >> > update,
>> > > >> > > I am seeing a different set of records. Just wanted to
>> understand
>> > > what
>> > > >> > > specifically has changed.
>> > > >> > >
>> > > >> > > select * from `file.parquet` limit 5;
>> > > >> > >
>> > > >> > > - Rahul
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Order of records read in a parquet file

Reply via email to