Re: Another columnar format Parquet

Timothy Chen Tue, 09 Apr 2013 08:44:31 -0700

Hi Ted,

Can you explain more about the question you have about encoding in ORC?


Tim

Sent from my iPhone

On Apr 4, 2013, at 11:01 PM, Ted Dunning <[email protected]> wrote:

> Yes it does.
> 
> I have seen conflicting docs on format it uses.  One seemed to say that
> complex cells were stored within a single cell.  The other seemed to say
> that nested structures were shredded in the style of Parquet or Dremel.
> 
> One thing that I worry about with ORC is that it exactly replicates the
> schema model of Hive which isn't as congenial (to me) as the protobuf style
> of Parquet.  As Julien mentioned in the Drill meetup, there is also the
> question of the correctness of the encoding.  The Dremel column shredding
> is pretty subtle.  Hopefully ORC authors started from first principles in
> designing the encoding.
> 
> 
> On Fri, Apr 5, 2013 at 1:12 AM, Jacques Nadeau <[email protected]> wrote:
> 
>> Does ORC support nested data?  How does it compare to the Dremel encoding
>> approach that Parquet utilizes?
>> 
>> Thanks,
>> Jacques
>> 
>> On Thu, Mar 28, 2013 at 11:22 PM, Owen O'Malley <[email protected]>
>> wrote:
>> 
>>> On Tue, Mar 12, 2013 at 11:45 AM, Ted Dunning <[email protected]>
>>> wrote:
>>> 
>>>> So is it fair to say that Parquet will be open to contributions and
>> will
>>>> hopefully develop an open community to drive it?
>>>> 
>>>> If so, that is an excellent development.
>>>> 
>>>> Is ORC file well enough developed for a comparison?
>>> 
>>> ORC is committed to Hive's trunk and seems more feature complete than
>>> Parquet. Parquet hasn't implemented indexes, dictionaries, or a datetime
>>> encoder yet. Obviously, if you have questions about ORC, please ask over
>> on
>>> Hive's dev list.
>>> 
>>> -- Owen
>>> 
>>> 
>>>> 
>>>> On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <[email protected]>
>> wrote:
>>>> 
>>>>> Hey Jacques,
>>>>> 
>>>>> Feel free to ping us with any questions. Despite some of the _users_
>> of
>>>>> Parquet competing with each other (eg query engines), we hope the
>> file
>>>>> format itself can be easily implemented by everyone and become
>>>> ubiquitous.
>>>>> 
>>>>> There are a few changes still in flight that we're working on, so you
>>> may
>>>>> want to join the parquet dev mailing list as well to follow along.
>>>>> 
>>>>> Thanks
>>>>> -Todd
>>>>> 
>>>>> On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <[email protected]
>>> 
>>>>> wrote:
>>>>> 
>>>>>> When you said soon, you meant very soon.  This looks like great
>> work.
>>>>>> Thanks for sharing it with the world.  Will come back after
>> spending
>>>>> some
>>>>>> time with it.
>>>>>> 
>>>>>> thanks again,
>>>>>> Jacques
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <[email protected]
>>> 
>>>>> wrote:
>>>>>> 
>>>>>>> The repo is now available: http://parquet.github.com/
>>>>>>> Let me know if you have questions
>>>>>>> 
>>>>>>> On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>>> [email protected]
>>>>> 
>>>>>>> wrote:
>>>>>>>> There definitely seem to be some new kids on the block.  I
>> really
>>>>> hope
>>>>>>> that
>>>>>>>> Drill can adopt either ORC or Parquet as a closely related
>>> "native"
>>>>>>> format.
>>>>>>>>  At the moment, I'm actually more focused on the in-memory
>>>> execution
>>>>>>>> format and the right abstraction to support compressed columnar
>>>>>> execution
>>>>>>>> and vectorization.  Historically, the biggest gaps I'd worry
>>> about
>>>>> are
>>>>>>>> java-centricity and expectation of early materialization &
>>>>>> decompression.
>>>>>>>> Once we get some execution stuff working, lets see how each
>> fits
>>>> in.
>>>>>>>> Rather than start a third competing format (or fourth if you
>>> count
>>>>>>>> Trevni), let's either use or extend/contribute back on one of
>> the
>>>>>>> existing
>>>>>>>> new kids.
>>>>>>>> 
>>>>>>>> Julien, do you think more will be shared about Parquet before
>> the
>>>>>> Hadoop
>>>>>>>> Summit so we can start toying with using it inside of Drill?
>>>>>>>> 
>>>>>>>> J
>>>>>>>> 
>>>>>>>> On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>>>>>>>> <[email protected]>wrote:
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> I've been trying to track down status/comparisons of various
>>>>> columnar
>>>>>>>>> formats, and just heard about Parquet.
>>>>>>>>> 
>>>>>>>>> I don't have any direct experience with Parquet, but Really
>>> Smart
>>>>> Guy
>>>>>>> said:
>>>>>>>>> 
>>>>>>>>>> From what I hear there are two key features that
>>>>>>>>>> differentiate it from ORC and Trevni: 1) columns can be
>>>> optionally
>>>>>>> split
>>>>>>>>> into
>>>>>>>>>> separate files, and 2) the mechanism for shredding nested
>>> fields
>>>>>> into
>>>>>>>>>> columns is taken almost verbatim from Dremel. Feature (1)
>>> won't
>>>> be
>>>>>>>>> practical
>>>>>>>>>> to use until Hadoop introduces support for a file group
>>> locality
>>>>>>>>> feature, but once it
>>>>>>>>>> does this feature should enable more efficient use of the
>>> buffer
>>>>>> cache
>>>>>>>>> for predicate
>>>>>>>>>> pushdown operations.
>>>>>>>>> 
>>>>>>>>> -- Ken
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>>>>>>>>> 
>>>>>>>>>> Parquet is actually implementing the algorithm described in
>>> the
>>>>>>>>>> "Nested Columnar Storage" section of the Dremel paper[1].
>>>>>>>>>> 
>>>>>>>>>> [1] http://research.google.com/pubs/pub36632.html
>>>>>>>>>> 
>>>>>>>>>> On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>>>> [email protected]
>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> Just saw this:
>>>>>>>>>>> 
>>>>>>>>>>> http://t.co/ES1dGDZlKA
>>>>>>>>>>> 
>>>>>>>>>>> I know Trevni is another Dremel inspired Columnar format as
>>>> well,
>>>>>>> anyone
>>>>>>>>>>> saw much info Parquet and how it's different?
>>>>>>>>>>> 
>>>>>>>>>>> Tim
>>>>>>>>> 
>>>>>>>>> --------------------------
>>>>>>>>> Ken Krugler
>>>>>>>>> +1 530-210-6378
>>>>>>>>> http://www.scaleunlimited.com
>>>>>>>>> custom big data solutions & training
>>>>>>>>> Hadoop, Cascading, Cassandra & Solr
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>

Re: Another columnar format Parquet

Reply via email to