Re: Another columnar format Parquet

Ted Dunning Thu, 04 Apr 2013 23:02:09 -0700

Yes it does.

I have seen conflicting docs on format it uses.  One seemed to say that
complex cells were stored within a single cell.  The other seemed to say
that nested structures were shredded in the style of Parquet or Dremel.


One thing that I worry about with ORC is that it exactly replicates the
schema model of Hive which isn't as congenial (to me) as the protobuf style
of Parquet.  As Julien mentioned in the Drill meetup, there is also the
question of the correctness of the encoding.  The Dremel column shredding
is pretty subtle.  Hopefully ORC authors started from first principles in
designing the encoding.


On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <[email protected]> wrote:

> Does ORC support nested data?  How does it compare to the Dremel encoding
> approach that Parquet utilizes?
>
> Thanks,
> Jacques
>
> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <[email protected]>
> wrote:
>
> > On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > So is it fair to say that Parquet will be open to contributions and
> will
> > > hopefully develop an open community to drive it?
> > >
> > > If so, that is an excellent development.
> > >
> > > Is ORC file well enough developed for a comparison?
> > >
> >
> > ORC is committed to Hive's trunk and seems more feature complete than
> > Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
> > encoder yet. Obviously, if you have questions about ORC, please ask over
> on
> > Hive's dev list.
> >
> > -- Owen
> >
> >
> > >
> > > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]>
> wrote:
> > >
> > > > Hey Jacques,
> > > >
> > > > Feel free to ping us with any questions. Despite some of the _users_
> of
> > > > Parquet competing with each other (eg query engines), we hope the
> file
> > > > format itself can be easily implemented by everyone and become
> > > ubiquitous.
> > > >
> > > > There are a few changes still in flight that we're working on, so you
> > may
> > > > want to join the parquet dev mailing list as well to follow along.
> > > >
> > > > Thanks
> > > > -Todd
> > > >
> > > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]
> >
> > > > wrote:
> > > >
> > > > > When you said soon, you meant very soon.  This looks like great
> work.
> > > > >  Thanks for sharing it with the world.  Will come back after
> spending
> > > > some
> > > > > time with it.
> > > > >
> > > > > thanks again,
> > > > > Jacques
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > The repo is now available: http://parquet.github.com/
> > > > > > Let me know if you have questions
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > There definitely seem to be some new kids on the block.  I
> really
> > > > hope
> > > > > > that
> > > > > > > Drill can adopt either ORC or Parquet as a closely related
> > "native"
> > > > > > format.
> > > > > > >   At the moment, I'm actually more focused on the in-memory
> > > execution
> > > > > > > format and the right abstraction to support compressed columnar
> > > > > execution
> > > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> > about
> > > > are
> > > > > > > java-centricity and expectation of early materialization &
> > > > > decompression.
> > > > > > >  Once we get some execution stuff working, lets see how each
> fits
> > > in.
> > > > > > >  Rather than start a third competing format (or fourth if you
> > count
> > > > > > > Trevni), let's either use or extend/contribute back on one of
> the
> > > > > > existing
> > > > > > > new kids.
> > > > > > >
> > > > > > > Julien, do you think more will be shared about Parquet before
> the
> > > > > Hadoop
> > > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > > >
> > > > > > > J
> > > > > > >
> > > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > > <[email protected]>wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> I've been trying to track down status/comparisons of various
> > > > columnar
> > > > > > >> formats, and just heard about Parquet.
> > > > > > >>
> > > > > > >> I don't have any direct experience with Parquet, but Really
> > Smart
> > > > Guy
> > > > > > said:
> > > > > > >>
> > > > > > >> > From what I hear there are two key features that
> > > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > > optionally
> > > > > > split
> > > > > > >> into
> > > > > > >> > separate files, and 2) the mechanism for shredding nested
> > fields
> > > > > into
> > > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> > won't
> > > be
> > > > > > >> practical
> > > > > > >> > to use until Hadoop introduces support for a file group
> > locality
> > > > > > >> feature, but once it
> > > > > > >> > does this feature should enable more efficient use of the
> > buffer
> > > > > cache
> > > > > > >> for predicate
> > > > > > >> > pushdown operations.
> > > > > > >>
> > > > > > >> -- Ken
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > > >>
> > > > > > >> > Parquet is actually implementing the algorithm described in
> > the
> > > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > > >> >
> > > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > > >> >
> > > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > > [email protected]
> > > > >
> > > > > > >> wrote:
> > > > > > >> >> Just saw this:
> > > > > > >> >>
> > > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > > >> >>
> > > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > > well,
> > > > > > anyone
> > > > > > >> >> saw much info Parquet and how it's different?
> > > > > > >> >>
> > > > > > >> >> Tim
> > > > > > >>
> > > > > > >> --------------------------
> > > > > > >> Ken Krugler
> > > > > > >> +1 530-210-6378
> > > > > > >> http://www.scaleunlimited.com
> > > > > > >> custom big data solutions & training
> > > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> >
>

Re: Another columnar format Parquet

Reply via email to