Re: Another columnar format Parquet

Julien Le Dem Tue, 12 Mar 2013 10:52:31 -0700

the mailling list: [email protected]


On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> wrote:
> Hey Jacques,
>
> Feel free to ping us with any questions. Despite some of the _users_ of
> Parquet competing with each other (eg query engines), we hope the file
> format itself can be easily implemented by everyone and become ubiquitous.
>
> There are a few changes still in flight that we're working on, so you may
> want to join the parquet dev mailing list as well to follow along.
>
> Thanks
> -Todd
>
> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]> wrote:
>
>> When you said soon, you meant very soon.  This looks like great work.
>>  Thanks for sharing it with the world.  Will come back after spending some
>> time with it.
>>
>> thanks again,
>> Jacques
>>
>>
>>
>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]> wrote:
>>
>> > The repo is now available: http://parquet.github.com/
>> > Let me know if you have questions
>> >
>> > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <[email protected]>
>> > wrote:
>> > > There definitely seem to be some new kids on the block.  I really hope
>> > that
>> > > Drill can adopt either ORC or Parquet as a closely related "native"
>> > format.
>> > >   At the moment, I'm actually more focused on the in-memory execution
>> > > format and the right abstraction to support compressed columnar
>> execution
>> > > and vectorization.  Historically, the biggest gaps I'd worry about are
>> > > java-centricity and expectation of early materialization &
>> decompression.
>> > >  Once we get some execution stuff working, lets see how each fits in.
>> > >  Rather than start a third competing format (or fourth if you count
>> > > Trevni), let's either use or extend/contribute back on one of the
>> > existing
>> > > new kids.
>> > >
>> > > Julien, do you think more will be shared about Parquet before the
>> Hadoop
>> > > Summit so we can start toying with using it inside of Drill?
>> > >
>> > > J
>> > >
>> > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>> > > <[email protected]>wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I've been trying to track down status/comparisons of various columnar
>> > >> formats, and just heard about Parquet.
>> > >>
>> > >> I don't have any direct experience with Parquet, but Really Smart Guy
>> > said:
>> > >>
>> > >> > From what I hear there are two key features that
>> > >> > differentiate it from ORC and Trevni: 1) columns can be optionally
>> > split
>> > >> into
>> > >> > separate files, and 2) the mechanism for shredding nested fields
>> into
>> > >> > columns is taken almost verbatim from Dremel. Feature (1) won't be
>> > >> practical
>> > >> > to use until Hadoop introduces support for a file group locality
>> > >> feature, but once it
>> > >> > does this feature should enable more efficient use of the buffer
>> cache
>> > >> for predicate
>> > >> > pushdown operations.
>> > >>
>> > >> -- Ken
>> > >>
>> > >>
>> > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>> > >>
>> > >> > Parquet is actually implementing the algorithm described in the
>> > >> > "Nested Columnar Storage" section of the Dremel paper[1].
>> > >> >
>> > >> > [1] http://research.google.com/pubs/pub36632.html
>> > >> >
>> > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <[email protected]>
>> > >> wrote:
>> > >> >> Just saw this:
>> > >> >>
>> > >> >> http://t.co/ES1dGDZlKA
>> > >> >>
>> > >> >> I know Trevni is another Dremel inspired Columnar format as well,
>> > anyone
>> > >> >> saw much info Parquet and how it's different?
>> > >> >>
>> > >> >> Tim
>> > >>
>> > >> --------------------------
>> > >> Ken Krugler
>> > >> +1 530-210-6378
>> > >> http://www.scaleunlimited.com
>> > >> custom big data solutions & training
>> > >> Hadoop, Cascading, Cassandra & Solr
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: Another columnar format Parquet

Reply via email to