Re: Another columnar format Parquet

Jacques Nadeau Thu, 04 Apr 2013 16:13:18 -0700

Does ORC support nested data?  How does it compare to the Dremel encoding
approach that Parquet utilizes?


Thanks,
Jacques

On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <[email protected]> wrote:

> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <[email protected]>
> wrote:
>
> > So is it fair to say that Parquet will be open to contributions and will
> > hopefully develop an open community to drive it?
> >
> > If so, that is an excellent development.
> >
> > Is ORC file well enough developed for a comparison?
> >
>
> ORC is committed to Hive's trunk and seems more feature complete than
> Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
> encoder yet. Obviously, if you have questions about ORC, please ask over on
> Hive's dev list.
>
> -- Owen
>
>
> >
> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> wrote:
> >
> > > Hey Jacques,
> > >
> > > Feel free to ping us with any questions. Despite some of the _users_ of
> > > Parquet competing with each other (eg query engines), we hope the file
> > > format itself can be easily implemented by everyone and become
> > ubiquitous.
> > >
> > > There are a few changes still in flight that we're working on, so you
> may
> > > want to join the parquet dev mailing list as well to follow along.
> > >
> > > Thanks
> > > -Todd
> > >
> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]>
> > > wrote:
> > >
> > > > When you said soon, you meant very soon.  This looks like great work.
> > > >  Thanks for sharing it with the world.  Will come back after spending
> > > some
> > > > time with it.
> > > >
> > > > thanks again,
> > > > Jacques
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]>
> > > wrote:
> > > >
> > > > > The repo is now available: http://parquet.github.com/
> > > > > Let me know if you have questions
> > > > >
> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> [email protected]
> > >
> > > > > wrote:
> > > > > > There definitely seem to be some new kids on the block.  I really
> > > hope
> > > > > that
> > > > > > Drill can adopt either ORC or Parquet as a closely related
> "native"
> > > > > format.
> > > > > >   At the moment, I'm actually more focused on the in-memory
> > execution
> > > > > > format and the right abstraction to support compressed columnar
> > > > execution
> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
> about
> > > are
> > > > > > java-centricity and expectation of early materialization &
> > > > decompression.
> > > > > >  Once we get some execution stuff working, lets see how each fits
> > in.
> > > > > >  Rather than start a third competing format (or fourth if you
> count
> > > > > > Trevni), let's either use or extend/contribute back on one of the
> > > > > existing
> > > > > > new kids.
> > > > > >
> > > > > > Julien, do you think more will be shared about Parquet before the
> > > > Hadoop
> > > > > > Summit so we can start toying with using it inside of Drill?
> > > > > >
> > > > > > J
> > > > > >
> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> > > > > > <[email protected]>wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> I've been trying to track down status/comparisons of various
> > > columnar
> > > > > >> formats, and just heard about Parquet.
> > > > > >>
> > > > > >> I don't have any direct experience with Parquet, but Really
> Smart
> > > Guy
> > > > > said:
> > > > > >>
> > > > > >> > From what I hear there are two key features that
> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> > optionally
> > > > > split
> > > > > >> into
> > > > > >> > separate files, and 2) the mechanism for shredding nested
> fields
> > > > into
> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
> won't
> > be
> > > > > >> practical
> > > > > >> > to use until Hadoop introduces support for a file group
> locality
> > > > > >> feature, but once it
> > > > > >> > does this feature should enable more efficient use of the
> buffer
> > > > cache
> > > > > >> for predicate
> > > > > >> > pushdown operations.
> > > > > >>
> > > > > >> -- Ken
> > > > > >>
> > > > > >>
> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> > > > > >>
> > > > > >> > Parquet is actually implementing the algorithm described in
> the
> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
> > > > > >> >
> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> > > > > >> >
> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> > [email protected]
> > > >
> > > > > >> wrote:
> > > > > >> >> Just saw this:
> > > > > >> >>
> > > > > >> >> http://t.co/ES1dGDZlKA
> > > > > >> >>
> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
> > well,
> > > > > anyone
> > > > > >> >> saw much info Parquet and how it's different?
> > > > > >> >>
> > > > > >> >> Tim
> > > > > >>
> > > > > >> --------------------------
> > > > > >> Ken Krugler
> > > > > >> +1 530-210-6378
> > > > > >> http://www.scaleunlimited.com
> > > > > >> custom big data solutions & training
> > > > > >> Hadoop, Cascading, Cassandra & Solr
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>

Re: Another columnar format Parquet

Reply via email to