Re: Another columnar format Parquet

Ted Dunning Wed, 13 Mar 2013 15:18:05 -0700

GPL dependencies are always a problem.  This support would be a great
candidate for an external project.


On Wed, Mar 13, 2013 at 2:08 PM, Tsuyoshi OZAWA <[email protected]>wrote:

> One alternative columnar storage is wiredtiger used by amazon.com.
> It provides with a columnar storage and record-style storage library
> API like berkley DB.
>
> One concern is that wiredtiger is licensed by GPL and BSD.
> However, supporting it can empower Drill project.
>
> http://wiredtiger.com/
>
> On Wed, Mar 13, 2013 at 4:22 PM, Ted Dunning <[email protected]>
> wrote:
> > Can you bring 5 slides on parquet?  (ppt or pptx?)
> >
> > On Tue, Mar 12, 2013 at 8:59 PM, Julien Le Dem <[email protected]>
> wrote:
> >
> >> I should be able to come to the Drill meetup tomorrow.
> >> We can chat about it then.
> >> Julien
> >>
> >> On Tue, Mar 12, 2013 at 1:43 PM, Dmitriy Ryaboy <[email protected]>
> >> wrote:
> >> > ColumnIO implementations return values from a column independently of
> >> other
> >> > columns; RecordReaderImplementation does materialize the whole record
> (by
> >> > using a bunch of column readers at the same time). You could
> construct a
> >> > column-at-a-time, late materialization api by dropping directly into
> >> using
> >> > column readers; so it just depends on which level of abstraction you
> want
> >> > to hook up with.
> >> >
> >> > We were initially concerned with "record-oriented" frameworks so we
> built
> >> > the record materialization machinery for them first; a  more truly
> >> columnar
> >> > engine should work with ColumnIO instead of RecordReaders.
> >> >
> >> > Also, since the API is still young, it's certainly open to discussion
> and
> >> > improvement.
> >> >
> >> > D
> >> >
> >> >
> >> > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <[email protected]>
> wrote:
> >> >
> >> >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <[email protected]
> >
> >> >> wrote:
> >> >>
> >> >> > Joined, thanks.  I'm glad that the approach was open for this.  I
> >> think
> >> >> > that helps it chances to be ubiquitous.  As much as this might be
> >> >> > blasphemous to some, I really hope that the final solution to the
> >> query
> >> >> > wars is a collaborative solution as opposed to a competitive one.
> >> >> >
> >> >> > Having not looked at the code yet, do the existing read interfaces
> >> >> support
> >> >> > working with "late materialization" execution strategies similar to
> >> some
> >> >> of
> >> >> > the ideas at [1]?  Definitely seems harder to implement in a
> >> >> > nested/repeated environment but wanted to get a sense of the
> thinking
> >> >> > behind the initial efforts.
> >> >> >
> >> >>
> >> >> The existing read interface in Java is tuple-at-a-time, but there's
> no
> >> >> reason one couldn't build a column-at-a-time late materialization
> >> approach.
> >> >> It would just be a lot more "custom", and not directly user-usable,
> so
> >> >> there's none in the initial implementation.
> >> >>
> >> >> Like you said, it's a little tougher with arbitrary nesting, but I
> think
> >> >> still doable.
> >> >>
> >> >> -Todd
> >> >>
> >> >> >
> >> >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]>
> >> wrote:
> >> >> >
> >> >> > > Hey Jacques,
> >> >> > >
> >> >> > > Feel free to ping us with any questions. Despite some of the
> >> _users_ of
> >> >> > > Parquet competing with each other (eg query engines), we hope the
> >> file
> >> >> > > format itself can be easily implemented by everyone and become
> >> >> > ubiquitous.
> >> >> > >
> >> >> > > There are a few changes still in flight that we're working on, so
> >> you
> >> >> may
> >> >> > > want to join the parquet dev mailing list as well to follow
> along.
> >> >> > >
> >> >> > > Thanks
> >> >> > > -Todd
> >> >> > >
> >> >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <
> >> [email protected]>
> >> >> > > wrote:
> >> >> > >
> >> >> > > > When you said soon, you meant very soon.  This looks like great
> >> work.
> >> >> > > >  Thanks for sharing it with the world.  Will come back after
> >> spending
> >> >> > > some
> >> >> > > > time with it.
> >> >> > > >
> >> >> > > > thanks again,
> >> >> > > > Jacques
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <
> >> [email protected]>
> >> >> > > wrote:
> >> >> > > >
> >> >> > > > > The repo is now available: http://parquet.github.com/
> >> >> > > > > Let me know if you have questions
> >> >> > > > >
> >> >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
> >> >> [email protected]
> >> >> > >
> >> >> > > > > wrote:
> >> >> > > > > > There definitely seem to be some new kids on the block.  I
> >> really
> >> >> > > hope
> >> >> > > > > that
> >> >> > > > > > Drill can adopt either ORC or Parquet as a closely related
> >> >> "native"
> >> >> > > > > format.
> >> >> > > > > >   At the moment, I'm actually more focused on the in-memory
> >> >> > execution
> >> >> > > > > > format and the right abstraction to support compressed
> >> columnar
> >> >> > > > execution
> >> >> > > > > > and vectorization.  Historically, the biggest gaps I'd
> worry
> >> >> about
> >> >> > > are
> >> >> > > > > > java-centricity and expectation of early materialization &
> >> >> > > > decompression.
> >> >> > > > > >  Once we get some execution stuff working, lets see how
> each
> >> fits
> >> >> > in.
> >> >> > > > > >  Rather than start a third competing format (or fourth if
> you
> >> >> count
> >> >> > > > > > Trevni), let's either use or extend/contribute back on one
> of
> >> the
> >> >> > > > > existing
> >> >> > > > > > new kids.
> >> >> > > > > >
> >> >> > > > > > Julien, do you think more will be shared about Parquet
> before
> >> the
> >> >> > > > Hadoop
> >> >> > > > > > Summit so we can start toying with using it inside of
> Drill?
> >> >> > > > > >
> >> >> > > > > > J
> >> >> > > > > >
> >> >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
> >> >> > > > > > <[email protected]>wrote:
> >> >> > > > > >
> >> >> > > > > >> Hi all,
> >> >> > > > > >>
> >> >> > > > > >> I've been trying to track down status/comparisons of
> various
> >> >> > > columnar
> >> >> > > > > >> formats, and just heard about Parquet.
> >> >> > > > > >>
> >> >> > > > > >> I don't have any direct experience with Parquet, but
> Really
> >> >> Smart
> >> >> > > Guy
> >> >> > > > > said:
> >> >> > > > > >>
> >> >> > > > > >> > From what I hear there are two key features that
> >> >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
> >> >> > optionally
> >> >> > > > > split
> >> >> > > > > >> into
> >> >> > > > > >> > separate files, and 2) the mechanism for shredding
> nested
> >> >> fields
> >> >> > > > into
> >> >> > > > > >> > columns is taken almost verbatim from Dremel. Feature
> (1)
> >> >> won't
> >> >> > be
> >> >> > > > > >> practical
> >> >> > > > > >> > to use until Hadoop introduces support for a file group
> >> >> locality
> >> >> > > > > >> feature, but once it
> >> >> > > > > >> > does this feature should enable more efficient use of
> the
> >> >> buffer
> >> >> > > > cache
> >> >> > > > > >> for predicate
> >> >> > > > > >> > pushdown operations.
> >> >> > > > > >>
> >> >> > > > > >> -- Ken
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
> >> >> > > > > >>
> >> >> > > > > >> > Parquet is actually implementing the algorithm
> described in
> >> >> the
> >> >> > > > > >> > "Nested Columnar Storage" section of the Dremel
> paper[1].
> >> >> > > > > >> >
> >> >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
> >> >> > > > > >> >
> >> >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
> >> >> > [email protected]
> >> >> > > >
> >> >> > > > > >> wrote:
> >> >> > > > > >> >> Just saw this:
> >> >> > > > > >> >>
> >> >> > > > > >> >> http://t.co/ES1dGDZlKA
> >> >> > > > > >> >>
> >> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar
> format
> >> as
> >> >> > well,
> >> >> > > > > anyone
> >> >> > > > > >> >> saw much info Parquet and how it's different?
> >> >> > > > > >> >>
> >> >> > > > > >> >> Tim
> >> >> > > > > >>
> >> >> > > > > >> --------------------------
> >> >> > > > > >> Ken Krugler
> >> >> > > > > >> +1 530-210-6378
> >> >> > > > > >> http://www.scaleunlimited.com
> >> >> > > > > >> custom big data solutions & training
> >> >> > > > > >> Hadoop, Cascading, Cassandra & Solr
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > --
> >> >> > > Todd Lipcon
> >> >> > > Software Engineer, Cloudera
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Todd Lipcon
> >> >> Software Engineer, Cloudera
> >> >>
> >>
>
>
>
> --
> - Tsuyoshi
>

Re: Another columnar format Parquet

Reply via email to