As I said on the JIRA, I fixed the reading issue for the file you posted.
I'm working on a unit test to catch these things sooner in the future.

On Fri, Nov 6, 2015 at 6:16 PM, Jacques Nadeau <[email protected]> wrote:

> My question was the other way around. If the reader is corrupting things,
> I'd like to do a ctas from parquet => json and look if the json is
> corrupted. Jason is taking a look now.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Nov 6, 2015 at 6:08 PM, rahul challapalli <
> [email protected]> wrote:
>
> > I did try your suggestion and sqlline displayed the columns from the json
> > file just fine. Raised the below jira to track this issue
> > https://issues.apache.org/jira/browse/DRILL-4048
> >
> > On Fri, Nov 6, 2015 at 5:52 PM, Jacques Nadeau <[email protected]>
> wrote:
> >
> > > I wouldn't jump to that conclusion. Sqlline uses toString. If we
> changed
> > > the toString behavior, it could be a problem. Maybe do a ctas to a json
> > > file to confirm.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Fri, Nov 6, 2015 at 5:40 PM, rahul challapalli <
> > > [email protected]> wrote:
> > >
> > > > From a previous build, I got the data for these columns just fine
> from
> > > > sqlline. So I think we can eliminate any display issues unless I am
> > > missing
> > > > something?
> > > >
> > > > - Rahul
> > > >
> > > > On Fri, Nov 6, 2015 at 5:34 PM, Jacques Nadeau <[email protected]>
> > > wrote:
> > > >
> > > > > Can you confirm if this is a display bug in sqlline or jdbc to
> string
> > > > > versus an actual data return?
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Fri, Nov 6, 2015 at 5:31 PM, rahul challapalli <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Jason,
> > > > > >
> > > > > > You were partly correct. We are not dropping records however we
> are
> > > > > > corrupting dictionary encoded binary columns. I got confused that
> > we
> > > > are
> > > > > > returning different records, but we are trimming (or returning
> > > > unreadable
> > > > > > chars) some columns which are binary. I was able to reproduce
> with
> > > the
> > > > > > lineitem data set. I will raise a jira and I think this should be
> > > > treated
> > > > > > critical. Thoughts?
> > > > > >
> > > > > > - Rahul
> > > > > >
> > > > > > On Fri, Nov 6, 2015 at 4:30 PM, rahul challapalli <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Jason,
> > > > > > >
> > > > > > > I missed that. Let me check whether we are dropping any
> records.
> > I
> > > > > would
> > > > > > > be surprised if our regression tests missed that :)
> > > > > > >
> > > > > > > - Rahul
> > > > > > >
> > > > > > > On Fri, Nov 6, 2015 at 4:19 PM, Jason Altekruse <
> > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Rahul,
> > > > > > >>
> > > > > > >> Thanks for working on a reproduction of the issue. You didn't
> > > > actually
> > > > > > >> answer my first question, are you getting the same data out of
> > the
> > > > > file,
> > > > > > >> just in a different order? It seems much more likely that we
> are
> > > > > > dropping
> > > > > > >> some records at the beginning than reordering them somehow,
> > > > although I
> > > > > > >> would have expected an error like this to be caught by the
> unit
> > or
> > > > > > >> regression tests.
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Jason
> > > > > > >>
> > > > > > >> On Fri, Nov 6, 2015 at 4:13 PM, rahul challapalli <
> > > > > > >> [email protected]> wrote:
> > > > > > >>
> > > > > > >> > Thanks for your replies. The file is private and I will try
> to
> > > > > > >> construct a
> > > > > > >> > file without sensitive data which can expose this behavior.
> > > > > > >> >
> > > > > > >> > - Rahul
> > > > > > >> >
> > > > > > >> > On Fri, Nov 6, 2015 at 3:45 PM, Jason Altekruse <
> > > > > > >> [email protected]>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > Is this a large or private parquet file? Can you share it
> to
> > > > allow
> > > > > > me
> > > > > > >> to
> > > > > > >> > > debug the read path for it?
> > > > > > >> > >
> > > > > > >> > > On Fri, Nov 6, 2015 at 3:37 PM, Jason Altekruse <
> > > > > > >> > [email protected]>
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > The changes to parquet were not supposed to be
> functional
> > at
> > > > > all.
> > > > > > We
> > > > > > >> > had
> > > > > > >> > > > been maintaining our fork of parquet-mr to have a
> > ByteBuffer
> > > > > based
> > > > > > >> read
> > > > > > >> > > and
> > > > > > >> > > > write path to reduce heap memory usage. The work done
> was
> > > just
> > > > > > >> getting
> > > > > > >> > > > these changes merged back into parquet-mr and making
> > > > > corresponding
> > > > > > >> > > changes
> > > > > > >> > > > in Drill to accommodate any interface modifications
> > > introduced
> > > > > > >> since we
> > > > > > >> > > > last rebased (there were mostly just package renames).
> > There
> > > > > were
> > > > > > a
> > > > > > >> lot
> > > > > > >> > > of
> > > > > > >> > > > comments on the PR, and a decent amount of refactoring
> > that
> > > > was
> > > > > > >> done to
> > > > > > >> > > > consolidate and otherwise clean up the code, but there
> > > > shouldn't
> > > > > > >> have
> > > > > > >> > > been
> > > > > > >> > > > any changes to the behavior of the reader or writer.
> > > > > > >> > > >
> > > > > > >> > > > Are you getting all of the same data out if you read the
> > > whole
> > > > > > file,
> > > > > > >> > just
> > > > > > >> > > > in a different order?
> > > > > > >> > > >
> > > > > > >> > > > On Fri, Nov 6, 2015 at 3:31 PM, rahul challapalli <
> > > > > > >> > > > [email protected]> wrote:
> > > > > > >> > > >
> > > > > > >> > > >> parquet-meta command suggests that there is only one
> row
> > > > group
> > > > > > >> > > >>
> > > > > > >> > > >> On Fri, Nov 6, 2015 at 3:23 PM, Jacques Nadeau <
> > > > > > [email protected]
> > > > > > >> >
> > > > > > >> > > >> wrote:
> > > > > > >> > > >>
> > > > > > >> > > >> > How many row groups?
> > > > > > >> > > >> >
> > > > > > >> > > >> > --
> > > > > > >> > > >> > Jacques Nadeau
> > > > > > >> > > >> > CTO and Co-Founder, Dremio
> > > > > > >> > > >> >
> > > > > > >> > > >> > On Fri, Nov 6, 2015 at 3:14 PM, rahul challapalli <
> > > > > > >> > > >> > [email protected]> wrote:
> > > > > > >> > > >> >
> > > > > > >> > > >> > > Drillers,
> > > > > > >> > > >> > >
> > > > > > >> > > >> > > With the new parquet library update, can someone
> > throw
> > > > some
> > > > > > >> light
> > > > > > >> > on
> > > > > > >> > > >> the
> > > > > > >> > > >> > > order in which the records are read from a single
> > > parquet
> > > > > > file?
> > > > > > >> > > >> > >
> > > > > > >> > > >> > > With the older library, when I run the below query
> > on a
> > > > > > single
> > > > > > >> > > parquet
> > > > > > >> > > >> > > file, I used to get a set of records. Now after the
> > > > parquet
> > > > > > >> > library
> > > > > > >> > > >> > update,
> > > > > > >> > > >> > > I am seeing a different set of records. Just wanted
> > to
> > > > > > >> understand
> > > > > > >> > > what
> > > > > > >> > > >> > > specifically has changed.
> > > > > > >> > > >> > >
> > > > > > >> > > >> > > select * from `file.parquet` limit 5;
> > > > > > >> > > >> > >
> > > > > > >> > > >> > > - Rahul
> > > > > > >> > > >> > >
> > > > > > >> > > >> >
> > > > > > >> > > >>
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to