Pull requests are more than welcome. You can open an issue on github or email the list to start a discussion Julien
On Tue, Mar 12, 2013 at 11:29 AM, Jacques Nadeau <[email protected]> wrote: > Bummer, that's what I figured. That just means there is an opportunity > for extension, right? :) > > J > > > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <[email protected]> wrote: > >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <[email protected]> >> wrote: >> >> > Joined, thanks. I'm glad that the approach was open for this. I think >> > that helps it chances to be ubiquitous. As much as this might be >> > blasphemous to some, I really hope that the final solution to the query >> > wars is a collaborative solution as opposed to a competitive one. >> > >> > Having not looked at the code yet, do the existing read interfaces >> support >> > working with "late materialization" execution strategies similar to some >> of >> > the ideas at [1]? Definitely seems harder to implement in a >> > nested/repeated environment but wanted to get a sense of the thinking >> > behind the initial efforts. >> > >> >> The existing read interface in Java is tuple-at-a-time, but there's no >> reason one couldn't build a column-at-a-time late materialization approach. >> It would just be a lot more "custom", and not directly user-usable, so >> there's none in the initial implementation. >> >> Like you said, it's a little tougher with arbitrary nesting, but I think >> still doable. >> >> -Todd >> >> > >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> wrote: >> > >> > > Hey Jacques, >> > > >> > > Feel free to ping us with any questions. Despite some of the _users_ of >> > > Parquet competing with each other (eg query engines), we hope the file >> > > format itself can be easily implemented by everyone and become >> > ubiquitous. >> > > >> > > There are a few changes still in flight that we're working on, so you >> may >> > > want to join the parquet dev mailing list as well to follow along. >> > > >> > > Thanks >> > > -Todd >> > > >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]> >> > > wrote: >> > > >> > > > When you said soon, you meant very soon. This looks like great work. >> > > > Thanks for sharing it with the world. Will come back after spending >> > > some >> > > > time with it. >> > > > >> > > > thanks again, >> > > > Jacques >> > > > >> > > > >> > > > >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]> >> > > wrote: >> > > > >> > > > > The repo is now available: http://parquet.github.com/ >> > > > > Let me know if you have questions >> > > > > >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau < >> [email protected] >> > > >> > > > > wrote: >> > > > > > There definitely seem to be some new kids on the block. I really >> > > hope >> > > > > that >> > > > > > Drill can adopt either ORC or Parquet as a closely related >> "native" >> > > > > format. >> > > > > > At the moment, I'm actually more focused on the in-memory >> > execution >> > > > > > format and the right abstraction to support compressed columnar >> > > > execution >> > > > > > and vectorization. Historically, the biggest gaps I'd worry >> about >> > > are >> > > > > > java-centricity and expectation of early materialization & >> > > > decompression. >> > > > > > Once we get some execution stuff working, lets see how each fits >> > in. >> > > > > > Rather than start a third competing format (or fourth if you >> count >> > > > > > Trevni), let's either use or extend/contribute back on one of the >> > > > > existing >> > > > > > new kids. >> > > > > > >> > > > > > Julien, do you think more will be shared about Parquet before the >> > > > Hadoop >> > > > > > Summit so we can start toying with using it inside of Drill? >> > > > > > >> > > > > > J >> > > > > > >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler >> > > > > > <[email protected]>wrote: >> > > > > > >> > > > > >> Hi all, >> > > > > >> >> > > > > >> I've been trying to track down status/comparisons of various >> > > columnar >> > > > > >> formats, and just heard about Parquet. >> > > > > >> >> > > > > >> I don't have any direct experience with Parquet, but Really >> Smart >> > > Guy >> > > > > said: >> > > > > >> >> > > > > >> > From what I hear there are two key features that >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be >> > optionally >> > > > > split >> > > > > >> into >> > > > > >> > separate files, and 2) the mechanism for shredding nested >> fields >> > > > into >> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1) >> won't >> > be >> > > > > >> practical >> > > > > >> > to use until Hadoop introduces support for a file group >> locality >> > > > > >> feature, but once it >> > > > > >> > does this feature should enable more efficient use of the >> buffer >> > > > cache >> > > > > >> for predicate >> > > > > >> > pushdown operations. >> > > > > >> >> > > > > >> -- Ken >> > > > > >> >> > > > > >> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote: >> > > > > >> >> > > > > >> > Parquet is actually implementing the algorithm described in >> the >> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1]. >> > > > > >> > >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html >> > > > > >> > >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen < >> > [email protected] >> > > > >> > > > > >> wrote: >> > > > > >> >> Just saw this: >> > > > > >> >> >> > > > > >> >> http://t.co/ES1dGDZlKA >> > > > > >> >> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar format as >> > well, >> > > > > anyone >> > > > > >> >> saw much info Parquet and how it's different? >> > > > > >> >> >> > > > > >> >> Tim >> > > > > >> >> > > > > >> -------------------------- >> > > > > >> Ken Krugler >> > > > > >> +1 530-210-6378 >> > > > > >> http://www.scaleunlimited.com >> > > > > >> custom big data solutions & training >> > > > > >> Hadoop, Cascading, Cassandra & Solr >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > > >> > > >> > > -- >> > > Todd Lipcon >> > > Software Engineer, Cloudera >> > > >> > >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >>
