Re: Another columnar format Parquet

Julien Le Dem Tue, 12 Mar 2013 11:54:21 -0700

Pull requests are more than welcome.
You can open an issue on github or email the list to start a discussion
Julien


On Tue, Mar 12, 2013 at 11:29 AM, Jacques Nadeau <[email protected]> wrote:
> Bummer, that's what I figured.   That just means there is an opportunity
> for extension, right? :)
>
> J
>
>
> On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <[email protected]> wrote:
>
>> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <[email protected]>
>> wrote:
>>
>> > Joined, thanks.  I'm glad that the approach was open for this.  I think
>> > that helps it chances to be ubiquitous.  As much as this might be
>> > blasphemous to some, I really hope that the final solution to the query
>> > wars is a collaborative solution as opposed to a competitive one.
>> >
>> > Having not looked at the code yet, do the existing read interfaces
>> support
>> > working with "late materialization" execution strategies similar to some
>> of
>> > the ideas at [1]?  Definitely seems harder to implement in a
>> > nested/repeated environment but wanted to get a sense of the thinking
>> > behind the initial efforts.
>> >
>>
>> The existing read interface in Java is tuple-at-a-time, but there's no
>> reason one couldn't build a column-at-a-time late materialization approach.
>> It would just be a lot more "custom", and not directly user-usable, so
>> there's none in the initial implementation.
>>
>> Like you said, it's a little tougher with arbitrary nesting, but I think
>> still doable.
>>
>> -Todd
>>
>> >
>> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> wrote:
>> >
>> > > Hey Jacques,
>> > >
>> > > Feel free to ping us with any questions. Despite some of the _users_ of
>> > > Parquet competing with each other (eg query engines), we hope the file
>> > > format itself can be easily implemented by everyone and become
>> > ubiquitous.
>> > >
>> > > There are a few changes still in flight that we're working on, so you
>> may
>> > > want to join the parquet dev mailing list as well to follow along.
>> > >
>> > > Thanks
>> > > -Todd
>> > >
>> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]>
>> > > wrote:
>> > >
>> > > > When you said soon, you meant very soon.  This looks like great work.
>> > > >  Thanks for sharing it with the world.  Will come back after spending
>> > > some
>> > > > time with it.
>> > > >
>> > > > thanks again,
>> > > > Jacques
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]>
>> > > wrote:
>> > > >
>> > > > > The repo is now available: http://parquet.github.com/
>> > > > > Let me know if you have questions
>> > > > >
>> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > > > There definitely seem to be some new kids on the block.  I really
>> > > hope
>> > > > > that
>> > > > > > Drill can adopt either ORC or Parquet as a closely related
>> "native"
>> > > > > format.
>> > > > > >   At the moment, I'm actually more focused on the in-memory
>> > execution
>> > > > > > format and the right abstraction to support compressed columnar
>> > > > execution
>> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
>> about
>> > > are
>> > > > > > java-centricity and expectation of early materialization &
>> > > > decompression.
>> > > > > >  Once we get some execution stuff working, lets see how each fits
>> > in.
>> > > > > >  Rather than start a third competing format (or fourth if you
>> count
>> > > > > > Trevni), let's either use or extend/contribute back on one of the
>> > > > > existing
>> > > > > > new kids.
>> > > > > >
>> > > > > > Julien, do you think more will be shared about Parquet before the
>> > > > Hadoop
>> > > > > > Summit so we can start toying with using it inside of Drill?
>> > > > > >
>> > > > > > J
>> > > > > >
>> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>> > > > > > <[email protected]>wrote:
>> > > > > >
>> > > > > >> Hi all,
>> > > > > >>
>> > > > > >> I've been trying to track down status/comparisons of various
>> > > columnar
>> > > > > >> formats, and just heard about Parquet.
>> > > > > >>
>> > > > > >> I don't have any direct experience with Parquet, but Really
>> Smart
>> > > Guy
>> > > > > said:
>> > > > > >>
>> > > > > >> > From what I hear there are two key features that
>> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
>> > optionally
>> > > > > split
>> > > > > >> into
>> > > > > >> > separate files, and 2) the mechanism for shredding nested
>> fields
>> > > > into
>> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
>> won't
>> > be
>> > > > > >> practical
>> > > > > >> > to use until Hadoop introduces support for a file group
>> locality
>> > > > > >> feature, but once it
>> > > > > >> > does this feature should enable more efficient use of the
>> buffer
>> > > > cache
>> > > > > >> for predicate
>> > > > > >> > pushdown operations.
>> > > > > >>
>> > > > > >> -- Ken
>> > > > > >>
>> > > > > >>
>> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>> > > > > >>
>> > > > > >> > Parquet is actually implementing the algorithm described in
>> the
>> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
>> > > > > >> >
>> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
>> > > > > >> >
>> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>> > [email protected]
>> > > >
>> > > > > >> wrote:
>> > > > > >> >> Just saw this:
>> > > > > >> >>
>> > > > > >> >> http://t.co/ES1dGDZlKA
>> > > > > >> >>
>> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as
>> > well,
>> > > > > anyone
>> > > > > >> >> saw much info Parquet and how it's different?
>> > > > > >> >>
>> > > > > >> >> Tim
>> > > > > >>
>> > > > > >> --------------------------
>> > > > > >> Ken Krugler
>> > > > > >> +1 530-210-6378
>> > > > > >> http://www.scaleunlimited.com
>> > > > > >> custom big data solutions & training
>> > > > > >> Hadoop, Cascading, Cassandra & Solr
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Todd Lipcon
>> > > Software Engineer, Cloudera
>> > >
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>

Re: Another columnar format Parquet

Reply via email to