I will send a pull request soon with the code.
Repetition levels are redundant as they encode information from the parents in 
the leaf nodes.
I haven’t looked into that yet but we could make some versions of the code that 
ignore the parent nodes for the other leaves.
Julien

> On Jul 28, 2016, at 2:28 PM, Wes McKinney <[email protected]> wrote:
> 
> Hi Julien,
> 
> This is great to hear. Do you have code or an algorithm sketch for the
> conversion? I would like to work on the C++ Parquet to Arrow
> vectorized conversion in the next few months. One of the things I
> haven't thought through is how to jointly decode leaf nodes that are
> part of the same branch (e.g. foo.bar.baz and foo.bar.qux together)
> without redundant computation (perhaps this is what you're alluding
> too).
> 
> Thanks,
> Wes
> 
> On Sat, Jul 16, 2016 at 9:49 PM, Julien Le Dem <[email protected]> wrote:
>> On my end I did a few versions of vectorized conversion from parquet 
>> definition levels to arrow offsets.
>> Some tricks to avoid branching work well.
>> I'll publish something soon.
>> 
>> Julien
>> 
>>> On Jul 15, 2016, at 19:04, Jacques Nadeau <[email protected]> wrote:
>>> 
>>> Hello,
>>> 
>>> I had a great time at the Hackathon. Thanks to Julien for putting this
>>> together! Thanks to everyone who joined.
>>> 
>>> There were some good discussions and some exploration work. I started
>>> exploring a paradigm for supporting a zero performance impact abstraction
>>> approach to on and off heap access currently named slyheap. I'm exploring
>>> using sentinel objects and bytecode rewriting to avoid extra indirections
>>> for primitive arrays when wanting to swap out to ArrowBuf. I'll be out of
>>> of town over the next week but will try to post some progress on this the
>>> following week.
>>> 
>>> thanks,
>>> Jacques
>>> 
>>> 
>>>> On Thu, Jul 14, 2016 at 8:45 AM, Julien Le Dem <[email protected]> wrote:
>>>> 
>>>> I'm currently in
>>>> - the hangout: https://hangouts.google.com/hangouts/_/dremio.com/parquet
>>>> - the irc channel parquet on irc.freenode.net
>>>> 
>>>> On Tue, Jul 12, 2016 at 4:04 PM, Jacques Nadeau <[email protected]>
>>>> wrote:
>>>> 
>>>>> 883 N Shoreline Blvd, Suite C100, Mountain View, CA
>>>>> 
>>>>> On Tue, Jul 12, 2016 at 3:16 PM, Parth Chandra <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Can you post the address? I'll try to join the morning session.
>>>>>> 
>>>>>> On Mon, Jul 11, 2016 at 9:36 PM, Julien Le Dem <[email protected]>
>>>> wrote:
>>>>>> 
>>>>>>> Confirming that we’ll do the Parquet Hackathon this Thursday July
>>>> 14th
>>>>>>> Pacific time (GMT-7 in summer)
>>>>>>> There will be a Google hangout (I’ll send an invite and a link) and
>>>> an
>>>>>> IRC
>>>>>>> channel (parquet channel on irc.freenode.net)
>>>>>>> The location is the Dremio office on Shoreline Blvd, Mountain View,
>>>> CA
>>>>>>> 
>>>>>>> Responded:
>>>>>>> - Jason
>>>>>>> - Julien
>>>>>>> - Nezih
>>>>>>> - Deepak
>>>>>>> - Ryan
>>>>>>> - Jacques
>>>>>>> - Urvish
>>>>>>> Will join remotely:
>>>>>>> - Uwe (GMT+1 in the morning)
>>>>>>> - Ferd (GMT+8, in the afternoon 3:30pm -> 9pm)
>>>>>>> - Wes
>>>>>>> 
>>>>>>> I’ll probably be on irc/hangout while on the train 8:33am -> 9:46am
>>>> and
>>>>>> be
>>>>>>> there around 10am
>>>>>>> There will be people to open the door earlier.
>>>>>>> 
>>>>>>> Agenda/things that have been mentioned on the thread:
>>>>>>> - Parquet <-> Arrow
>>>>>>> - Parquet-cpp->Arrow-C++->PyArrow
>>>>>>> - https://issues.apache.org/jira/browse/HIVE-8128 <
>>>>>>> https://issues.apache.org/jira/browse/HIVE-8128>
>>>>>>> - vectorized read in Drill
>>>>>>> -
>>>> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet
>>>>>>> <
>>>> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet
>>>>>>> - https://github.com/apache/parquet-mr/pull/257 <
>>>>>>> https://github.com/apache/parquet-mr/pull/257>
>>>>>>> 
>>>>>>> Feel free to add more/show up
>>>>>>> 
>>>>>>>> On Jul 8, 2016, at 10:54 PM, Julien Le Dem <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> There is the parquet channel on irc.freenode.net
>>>>>>>> I'll set up a hangout as well.
>>>>>>>> 
>>>>>>>>> On Fri, Jul 8, 2016 at 9:54 AM, Wes McKinney <[email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Do we yet have a Slack / IRC for Parquet? I will be joining
>>>> remotely
>>>>>>>>> throughout the day. Anyone who is interested in algorithms for
>>>> Arrow
>>>>>>>>> nested data <-> Parquet disassembly/reassembly, we should start a
>>>>>>>>> shared Google document to detail algorithms and various test cases
>>>>>>>>> we'll need to address in the implementation.
>>>>>>>>> 
>>>>>>>>> On Wed, Jul 6, 2016 at 2:30 PM, Deepak Majeti <
>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> 14rth works for me too. I am mainly interested in vectorizing
>>>>>>>>>> parquet-cpp as well.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Jul 6, 2016 at 4:50 PM, Nezih Yigitbasi
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> 14th works for me too.
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jul 6, 2016 at 12:54 AM Uwe Korn <[email protected]>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Yes, I'm GMT +1
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 05.07.16 18:52, Julien Le Dem wrote:
>>>>>>>>>>>>> If there are people interested in the cpp implementation we’ll
>>>>>> talk
>>>>>>>>>>>> about that too.
>>>>>>>>>>>>> I’m happy to give context or help with the encoding. In
>>>>>> particular a
>>>>>>>>>>>> Parquet -> Arrow vectorized converter would be great.
>>>>>>>>>>>>> Are you GMT +1 ?
>>>>>>>>>>>>> We can schedule a 1 hour slot in the morning for discussing
>>>> with
>>>>>>>>> remote
>>>>>>>>>>>> folks in Europe. (same in afternoon if there are people joining
>>>>>> from
>>>>>>>>> Asia)
>>>>>>>>>>>>> Julien
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 5, 2016, at 2:37 AM, Uwe Korn <[email protected]>
>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> this effort is only for the parquet-mr project or would there
>>>>>> also
>>>>>>>>> be
>>>>>>>>>>>> some work/benefit for parquet-cpp? If so, I might join briefly
>>>>> in a
>>>>>>>>> hangout
>>>>>>>>>>>> but due to the timezone shift, I probably will not be able to
>>>> be
>>>>>>> awake
>>>>>>>>> all
>>>>>>>>>>>> the time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Uwe
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 02.07.16 01:01, Julien Le Dem wrote:
>>>>>>>>>>>>>>> Dear Parquet dev list,
>>>>>>>>>>>>>>> There have been efforts in several projects for vectorized
>>>>> reads
>>>>>>> of
>>>>>>>>>>>> Parquet.
>>>>>>>>>>>>>>> We had discussed during the Parquet sync up to organize a
>>>>>>>>> hackathon to
>>>>>>>>>>>>>>> brainstorm and look into a shared implementation.
>>>>>>>>>>>>>>> Some projects that would benefit:
>>>>>>>>>>>>>>> - Apache Drill
>>>>>>>>>>>>>>> - Apache Arrow
>>>>>>>>>>>>>>> - Apache Spark
>>>>>>>>>>>>>>> - Presto
>>>>>>>>>>>>>>> - Apache Hive
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm planning to organize this at the Dremio office in
>>>> Mountain
>>>>>>> View
>>>>>>>>>>>> with
>>>>>>>>>>>>>>> optionally a hangout for people who would want to join
>>>>> remotely.
>>>>>>>>>>>>>>> I'm adding to the "to:" people that have expressed interest
>>>> or
>>>>>>>>> could be
>>>>>>>>>>>>>>> interested but that's not an exhaustive list. Please respond
>>>>> to
>>>>>>>>> this
>>>>>>>>>>>> email
>>>>>>>>>>>>>>> if you wish to be included.
>>>>>>>>>>>>>>> Who's interested and what dates would work between this
>>>>> Tuesday
>>>>>>>>> 7/5 and
>>>>>>>>>>>>>>> Wednesday 7/20 ?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> regards,
>>>>>>>>>> Deepak Majeti
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Julien
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Julien
>>>> 
>> 

Reply via email to