Does ORC support nested data? How does it compare to the Dremel encoding approach that Parquet utilizes?
Thanks, Jacques On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <[email protected]> wrote: > On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <[email protected]> > wrote: > > > So is it fair to say that Parquet will be open to contributions and will > > hopefully develop an open community to drive it? > > > > If so, that is an excellent development. > > > > Is ORC file well enough developed for a comparison? > > > > ORC is committed to Hive's trunk and seems more feature complete than > Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime > encoder yet. Obviously, if you have questions about ORC, please ask over on > Hive's dev list. > > -- Owen > > > > > > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> wrote: > > > > > Hey Jacques, > > > > > > Feel free to ping us with any questions. Despite some of the _users_ of > > > Parquet competing with each other (eg query engines), we hope the file > > > format itself can be easily implemented by everyone and become > > ubiquitous. > > > > > > There are a few changes still in flight that we're working on, so you > may > > > want to join the parquet dev mailing list as well to follow along. > > > > > > Thanks > > > -Todd > > > > > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]> > > > wrote: > > > > > > > When you said soon, you meant very soon. This looks like great work. > > > > Thanks for sharing it with the world. Will come back after spending > > > some > > > > time with it. > > > > > > > > thanks again, > > > > Jacques > > > > > > > > > > > > > > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]> > > > wrote: > > > > > > > > > The repo is now available: http://parquet.github.com/ > > > > > Let me know if you have questions > > > > > > > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau < > [email protected] > > > > > > > > wrote: > > > > > > There definitely seem to be some new kids on the block. I really > > > hope > > > > > that > > > > > > Drill can adopt either ORC or Parquet as a closely related > "native" > > > > > format. > > > > > > At the moment, I'm actually more focused on the in-memory > > execution > > > > > > format and the right abstraction to support compressed columnar > > > > execution > > > > > > and vectorization. Historically, the biggest gaps I'd worry > about > > > are > > > > > > java-centricity and expectation of early materialization & > > > > decompression. > > > > > > Once we get some execution stuff working, lets see how each fits > > in. > > > > > > Rather than start a third competing format (or fourth if you > count > > > > > > Trevni), let's either use or extend/contribute back on one of the > > > > > existing > > > > > > new kids. > > > > > > > > > > > > Julien, do you think more will be shared about Parquet before the > > > > Hadoop > > > > > > Summit so we can start toying with using it inside of Drill? > > > > > > > > > > > > J > > > > > > > > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler > > > > > > <[email protected]>wrote: > > > > > > > > > > > >> Hi all, > > > > > >> > > > > > >> I've been trying to track down status/comparisons of various > > > columnar > > > > > >> formats, and just heard about Parquet. > > > > > >> > > > > > >> I don't have any direct experience with Parquet, but Really > Smart > > > Guy > > > > > said: > > > > > >> > > > > > >> > From what I hear there are two key features that > > > > > >> > differentiate it from ORC and Trevni: 1) columns can be > > optionally > > > > > split > > > > > >> into > > > > > >> > separate files, and 2) the mechanism for shredding nested > fields > > > > into > > > > > >> > columns is taken almost verbatim from Dremel. Feature (1) > won't > > be > > > > > >> practical > > > > > >> > to use until Hadoop introduces support for a file group > locality > > > > > >> feature, but once it > > > > > >> > does this feature should enable more efficient use of the > buffer > > > > cache > > > > > >> for predicate > > > > > >> > pushdown operations. > > > > > >> > > > > > >> -- Ken > > > > > >> > > > > > >> > > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote: > > > > > >> > > > > > >> > Parquet is actually implementing the algorithm described in > the > > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1]. > > > > > >> > > > > > > >> > [1] http://research.google.com/pubs/pub36632.html > > > > > >> > > > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen < > > [email protected] > > > > > > > > > >> wrote: > > > > > >> >> Just saw this: > > > > > >> >> > > > > > >> >> http://t.co/ES1dGDZlKA > > > > > >> >> > > > > > >> >> I know Trevni is another Dremel inspired Columnar format as > > well, > > > > > anyone > > > > > >> >> saw much info Parquet and how it's different? > > > > > >> >> > > > > > >> >> Tim > > > > > >> > > > > > >> -------------------------- > > > > > >> Ken Krugler > > > > > >> +1 530-210-6378 > > > > > >> http://www.scaleunlimited.com > > > > > >> custom big data solutions & training > > > > > >> Hadoop, Cascading, Cassandra & Solr > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > >
