Hi Ted, Can you explain more about the question you have about encoding in ORC?
Tim Sent from my iPhone On Apr 4, 2013, at 11:01 PM, Ted Dunning <[email protected]> wrote: > Yes it does. > > I have seen conflicting docs on format it uses. One seemed to say that > complex cells were stored within a single cell. The other seemed to say > that nested structures were shredded in the style of Parquet or Dremel. > > One thing that I worry about with ORC is that it exactly replicates the > schema model of Hive which isn't as congenial (to me) as the protobuf style > of Parquet. As Julien mentioned in the Drill meetup, there is also the > question of the correctness of the encoding. The Dremel column shredding > is pretty subtle. Hopefully ORC authors started from first principles in > designing the encoding. > > > On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <[email protected]> wrote: > >> Does ORC support nested data? How does it compare to the Dremel encoding >> approach that Parquet utilizes? >> >> Thanks, >> Jacques >> >> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <[email protected]> >> wrote: >> >>> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <[email protected]> >>> wrote: >>> >>>> So is it fair to say that Parquet will be open to contributions and >> will >>>> hopefully develop an open community to drive it? >>>> >>>> If so, that is an excellent development. >>>> >>>> Is ORC file well enough developed for a comparison? >>> >>> ORC is committed to Hive's trunk and seems more feature complete than >>> Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime >>> encoder yet. Obviously, if you have questions about ORC, please ask over >> on >>> Hive's dev list. >>> >>> -- Owen >>> >>> >>>> >>>> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]> >> wrote: >>>> >>>>> Hey Jacques, >>>>> >>>>> Feel free to ping us with any questions. Despite some of the _users_ >> of >>>>> Parquet competing with each other (eg query engines), we hope the >> file >>>>> format itself can be easily implemented by everyone and become >>>> ubiquitous. >>>>> >>>>> There are a few changes still in flight that we're working on, so you >>> may >>>>> want to join the parquet dev mailing list as well to follow along. >>>>> >>>>> Thanks >>>>> -Todd >>>>> >>>>> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected] >>> >>>>> wrote: >>>>> >>>>>> When you said soon, you meant very soon. This looks like great >> work. >>>>>> Thanks for sharing it with the world. Will come back after >> spending >>>>> some >>>>>> time with it. >>>>>> >>>>>> thanks again, >>>>>> Jacques >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected] >>> >>>>> wrote: >>>>>> >>>>>>> The repo is now available: http://parquet.github.com/ >>>>>>> Let me know if you have questions >>>>>>> >>>>>>> On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau < >>> [email protected] >>>>> >>>>>>> wrote: >>>>>>>> There definitely seem to be some new kids on the block. I >> really >>>>> hope >>>>>>> that >>>>>>>> Drill can adopt either ORC or Parquet as a closely related >>> "native" >>>>>>> format. >>>>>>>> At the moment, I'm actually more focused on the in-memory >>>> execution >>>>>>>> format and the right abstraction to support compressed columnar >>>>>> execution >>>>>>>> and vectorization. Historically, the biggest gaps I'd worry >>> about >>>>> are >>>>>>>> java-centricity and expectation of early materialization & >>>>>> decompression. >>>>>>>> Once we get some execution stuff working, lets see how each >> fits >>>> in. >>>>>>>> Rather than start a third competing format (or fourth if you >>> count >>>>>>>> Trevni), let's either use or extend/contribute back on one of >> the >>>>>>> existing >>>>>>>> new kids. >>>>>>>> >>>>>>>> Julien, do you think more will be shared about Parquet before >> the >>>>>> Hadoop >>>>>>>> Summit so we can start toying with using it inside of Drill? >>>>>>>> >>>>>>>> J >>>>>>>> >>>>>>>> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I've been trying to track down status/comparisons of various >>>>> columnar >>>>>>>>> formats, and just heard about Parquet. >>>>>>>>> >>>>>>>>> I don't have any direct experience with Parquet, but Really >>> Smart >>>>> Guy >>>>>>> said: >>>>>>>>> >>>>>>>>>> From what I hear there are two key features that >>>>>>>>>> differentiate it from ORC and Trevni: 1) columns can be >>>> optionally >>>>>>> split >>>>>>>>> into >>>>>>>>>> separate files, and 2) the mechanism for shredding nested >>> fields >>>>>> into >>>>>>>>>> columns is taken almost verbatim from Dremel. Feature (1) >>> won't >>>> be >>>>>>>>> practical >>>>>>>>>> to use until Hadoop introduces support for a file group >>> locality >>>>>>>>> feature, but once it >>>>>>>>>> does this feature should enable more efficient use of the >>> buffer >>>>>> cache >>>>>>>>> for predicate >>>>>>>>>> pushdown operations. >>>>>>>>> >>>>>>>>> -- Ken >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote: >>>>>>>>> >>>>>>>>>> Parquet is actually implementing the algorithm described in >>> the >>>>>>>>>> "Nested Columnar Storage" section of the Dremel paper[1]. >>>>>>>>>> >>>>>>>>>> [1] http://research.google.com/pubs/pub36632.html >>>>>>>>>> >>>>>>>>>> On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen < >>>> [email protected] >>>>>> >>>>>>>>> wrote: >>>>>>>>>>> Just saw this: >>>>>>>>>>> >>>>>>>>>>> http://t.co/ES1dGDZlKA >>>>>>>>>>> >>>>>>>>>>> I know Trevni is another Dremel inspired Columnar format as >>>> well, >>>>>>> anyone >>>>>>>>>>> saw much info Parquet and how it's different? >>>>>>>>>>> >>>>>>>>>>> Tim >>>>>>>>> >>>>>>>>> -------------------------- >>>>>>>>> Ken Krugler >>>>>>>>> +1 530-210-6378 >>>>>>>>> http://www.scaleunlimited.com >>>>>>>>> custom big data solutions & training >>>>>>>>> Hadoop, Cascading, Cassandra & Solr >>>>> >>>>> >>>>> >>>>> -- >>>>> Todd Lipcon >>>>> Software Engineer, Cloudera >>
