Re: Order of records read in a parquet file

Jacques Nadeau Fri, 06 Nov 2015 18:17:52 -0800

My question was the other way around. If the reader is corrupting things,
I'd like to do a ctas from parquet => json and look if the json is
corrupted. Jason is taking a look now.


--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Fri, Nov 6, 2015 at 6:08 PM, rahul challapalli <
[email protected]> wrote:

> I did try your suggestion and sqlline displayed the columns from the json
> file just fine. Raised the below jira to track this issue
> https://issues.apache.org/jira/browse/DRILL-4048
>
> On Fri, Nov 6, 2015 at 5:52 PM, Jacques Nadeau <[email protected]> wrote:
>
> > I wouldn't jump to that conclusion. Sqlline uses toString. If we changed
> > the toString behavior, it could be a problem. Maybe do a ctas to a json
> > file to confirm.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Fri, Nov 6, 2015 at 5:40 PM, rahul challapalli <
> > [email protected]> wrote:
> >
> > > From a previous build, I got the data for these columns just fine from
> > > sqlline. So I think we can eliminate any display issues unless I am
> > missing
> > > something?
> > >
> > > - Rahul
> > >
> > > On Fri, Nov 6, 2015 at 5:34 PM, Jacques Nadeau <[email protected]>
> > wrote:
> > >
> > > > Can you confirm if this is a display bug in sqlline or jdbc to string
> > > > versus an actual data return?
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Fri, Nov 6, 2015 at 5:31 PM, rahul challapalli <
> > > > [email protected]> wrote:
> > > >
> > > > > Jason,
> > > > >
> > > > > You were partly correct. We are not dropping records however we are
> > > > > corrupting dictionary encoded binary columns. I got confused that
> we
> > > are
> > > > > returning different records, but we are trimming (or returning
> > > unreadable
> > > > > chars) some columns which are binary. I was able to reproduce with
> > the
> > > > > lineitem data set. I will raise a jira and I think this should be
> > > treated
> > > > > critical. Thoughts?
> > > > >
> > > > > - Rahul
> > > > >
> > > > > On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Jason,
> > > > > >
> > > > > > I missed that. Let me check whether we are dropping any records.
> I
> > > > would
> > > > > > be surprised if our regression tests missed that :)
> > > > > >
> > > > > > - Rahul
> > > > > >
> > > > > > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> Rahul,
> > > > > >>
> > > > > >> Thanks for working on a reproduction of the issue. You didn't
> > > actually
> > > > > >> answer my first question, are you getting the same data out of
> the
> > > > file,
> > > > > >> just in a different order? It seems much more likely that we are
> > > > > dropping
> > > > > >> some records at the beginning than reordering them somehow,
> > > although I
> > > > > >> would have expected an error like this to be caught by the unit
> or
> > > > > >> regression tests.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Jason
> > > > > >>
> > > > > >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> > > > > >> [email protected]> wrote:
> > > > > >>
> > > > > >> > Thanks for your replies. The file is private and I will try to
> > > > > >> construct a
> > > > > >> > file without sensitive data which can expose this behavior.
> > > > > >> >
> > > > > >> > - Rahul
> > > > > >> >
> > > > > >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> > > > > >> [email protected]>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Is this a large or private parquet file? Can you share it to
> > > allow
> > > > > me
> > > > > >> to
> > > > > >> > > debug the read path for it?
> > > > > >> > >
> > > > > >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > > > > >> > [email protected]>
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > The changes to parquet were not supposed to be functional
> at
> > > > all.
> > > > > We
> > > > > >> > had
> > > > > >> > > > been maintaining our fork of parquet-mr to have a
> ByteBuffer
> > > > based
> > > > > >> read
> > > > > >> > > and
> > > > > >> > > > write path to reduce heap memory usage. The work done was
> > just
> > > > > >> getting
> > > > > >> > > > these changes merged back into parquet-mr and making
> > > > corresponding
> > > > > >> > > changes
> > > > > >> > > > in Drill to accommodate any interface modifications
> > introduced
> > > > > >> since we
> > > > > >> > > > last rebased (there were mostly just package renames).
> There
> > > > were
> > > > > a
> > > > > >> lot
> > > > > >> > > of
> > > > > >> > > > comments on the PR, and a decent amount of refactoring
> that
> > > was
> > > > > >> done to
> > > > > >> > > > consolidate and otherwise clean up the code, but there
> > > shouldn't
> > > > > >> have
> > > > > >> > > been
> > > > > >> > > > any changes to the behavior of the reader or writer.
> > > > > >> > > >
> > > > > >> > > > Are you getting all of the same data out if you read the
> > whole
> > > > > file,
> > > > > >> > just
> > > > > >> > > > in a different order?
> > > > > >> > > >
> > > > > >> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > > > > >> > > > [email protected]> wrote:
> > > > > >> > > >
> > > > > >> > > >> parquet-meta command suggests that there is only one row
> > > group
> > > > > >> > > >>
> > > > > >> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <
> > > > > [email protected]
> > > > > >> >
> > > > > >> > > >> wrote:
> > > > > >> > > >>
> > > > > >> > > >> > How many row groups?
> > > > > >> > > >> >
> > > > > >> > > >> > --
> > > > > >> > > >> > Jacques Nadeau
> > > > > >> > > >> > CTO and Co-Founder, Dremio
> > > > > >> > > >> >
> > > > > >> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> > > > > >> > > >> > [email protected]> wrote:
> > > > > >> > > >> >
> > > > > >> > > >> > > Drillers,
> > > > > >> > > >> > >
> > > > > >> > > >> > > With the new parquet library update, can someone
> throw
> > > some
> > > > > >> light
> > > > > >> > on
> > > > > >> > > >> the
> > > > > >> > > >> > > order in which the records are read from a single
> > parquet
> > > > > file?
> > > > > >> > > >> > >
> > > > > >> > > >> > > With the older library, when I run the below query
> on a
> > > > > single
> > > > > >> > > parquet
> > > > > >> > > >> > > file, I used to get a set of records. Now after the
> > > parquet
> > > > > >> > library
> > > > > >> > > >> > update,
> > > > > >> > > >> > > I am seeing a different set of records. Just wanted
> to
> > > > > >> understand
> > > > > >> > > what
> > > > > >> > > >> > > specifically has changed.
> > > > > >> > > >> > >
> > > > > >> > > >> > > select * from `file.parquet` limit 5;
> > > > > >> > > >> > >
> > > > > >> > > >> > > - Rahul
> > > > > >> > > >> > >
> > > > > >> > > >> >
> > > > > >> > > >>
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Order of records read in a parquet file

Reply via email to