Re: Another columnar format Parquet

Ted Dunning Tue, 12 Mar 2013 11:46:31 -0700

So is it fair to say that Parquet will be open to contributions and will
hopefully develop an open community to drive it?


If so, that is an excellent development.

Is ORC file well enough developed for a comparison?

On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> wrote:

> Hey Jacques,
>
> Feel free to ping us with any questions. Despite some of the _users_ of
> Parquet competing with each other (eg query engines), we hope the file
> format itself can be easily implemented by everyone and become ubiquitous.
>
> There are a few changes still in flight that we're working on, so you may
> want to join the parquet dev mailing list as well to follow along.
>
> Thanks
> -Todd
>
> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]>
> wrote:
>
> > When you said soon, you meant very soon.  This looks like great work.
> >  Thanks for sharing it with the world.  Will come back after spending
> some
> > time with it.
> >
> > thanks again,
> > Jacques
> >
> >
> >
> > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]>
> wrote:
> >
> > > The repo is now available: http://parquet.github.com/
> > > Let me know if you have questions
> > >
> > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <[email protected]>
> > > wrote:
> > > > There definitely seem to be some new kids on the block.  I really
> hope
> > > that
> > > > Drill can adopt either ORC or Parquet as a closely related "native"
> > > format.
> > > >   At the moment, I'm actually more focused on the in-memory execution
> > > > format and the right abstraction to support compressed columnar
> > execution
> > > > and vectorization.  Historically, the biggest gaps I'd worry about
> are
> > > > java-centricity and expectation of early materialization &
> > decompression.
> > > >  Once we get some execution stuff working, lets see how each fits in.
> > > >  Rather than start a third competing format (or fourth if you count
> > > > Trevni), let's either use or extend/contribute back on one of the
> > > existing
> > > > new kids.
> > > >
> > > > Julien, do you think more will be shared about Parquet before the
> > Hadoop
> > > > Summit so we can start toying with using it inside of Drill?
> > > >
> > > > J
> > > >
> > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > <[email protected]>wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I've been trying to track down status/comparisons of various
> columnar
> > > >> formats, and just heard about Parquet.
> > > >>
> > > >> I don't have any direct experience with Parquet, but Really Smart
> Guy
> > > said:
> > > >>
> > > >> > From what I hear there are two key features that
> > > >> > differentiate it from ORC and Trevni: 1) columns can be optionally
> > > split
> > > >> into
> > > >> > separate files, and 2) the mechanism for shredding nested fields
> > into
> > > >> > columns is taken almost verbatim from Dremel. Feature (1) won't be
> > > >> practical
> > > >> > to use until Hadoop introduces support for a file group locality
> > > >> feature, but once it
> > > >> > does this feature should enable more efficient use of the buffer
> > cache
> > > >> for predicate
> > > >> > pushdown operations.
> > > >>
> > > >> -- Ken
> > > >>
> > > >>
> > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > >>
> > > >> > Parquet is actually implementing the algorithm described in the
> > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > >> >
> > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > >> >
> > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <[email protected]
> >
> > > >> wrote:
> > > >> >> Just saw this:
> > > >> >>
> > > >> >> http://t.co/ES1dGDZlKA
> > > >> >>
> > > >> >> I know Trevni is another Dremel inspired Columnar format as well,
> > > anyone
> > > >> >> saw much info Parquet and how it's different?
> > > >> >>
> > > >> >> Tim
> > > >>
> > > >> --------------------------
> > > >> Ken Krugler
> > > >> +1 530-210-6378
> > > >> http://www.scaleunlimited.com
> > > >> custom big data solutions & training
> > > >> Hadoop, Cascading, Cassandra & Solr
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Another columnar format Parquet

Reply via email to