I will send a pull request soon with the code. Repetition levels are redundant as they encode information from the parents in the leaf nodes. I haven’t looked into that yet but we could make some versions of the code that ignore the parent nodes for the other leaves. Julien
> On Jul 28, 2016, at 2:28 PM, Wes McKinney <[email protected]> wrote: > > Hi Julien, > > This is great to hear. Do you have code or an algorithm sketch for the > conversion? I would like to work on the C++ Parquet to Arrow > vectorized conversion in the next few months. One of the things I > haven't thought through is how to jointly decode leaf nodes that are > part of the same branch (e.g. foo.bar.baz and foo.bar.qux together) > without redundant computation (perhaps this is what you're alluding > too). > > Thanks, > Wes > > On Sat, Jul 16, 2016 at 9:49 PM, Julien Le Dem <[email protected]> wrote: >> On my end I did a few versions of vectorized conversion from parquet >> definition levels to arrow offsets. >> Some tricks to avoid branching work well. >> I'll publish something soon. >> >> Julien >> >>> On Jul 15, 2016, at 19:04, Jacques Nadeau <[email protected]> wrote: >>> >>> Hello, >>> >>> I had a great time at the Hackathon. Thanks to Julien for putting this >>> together! Thanks to everyone who joined. >>> >>> There were some good discussions and some exploration work. I started >>> exploring a paradigm for supporting a zero performance impact abstraction >>> approach to on and off heap access currently named slyheap. I'm exploring >>> using sentinel objects and bytecode rewriting to avoid extra indirections >>> for primitive arrays when wanting to swap out to ArrowBuf. I'll be out of >>> of town over the next week but will try to post some progress on this the >>> following week. >>> >>> thanks, >>> Jacques >>> >>> >>>> On Thu, Jul 14, 2016 at 8:45 AM, Julien Le Dem <[email protected]> wrote: >>>> >>>> I'm currently in >>>> - the hangout: https://hangouts.google.com/hangouts/_/dremio.com/parquet >>>> - the irc channel parquet on irc.freenode.net >>>> >>>> On Tue, Jul 12, 2016 at 4:04 PM, Jacques Nadeau <[email protected]> >>>> wrote: >>>> >>>>> 883 N Shoreline Blvd, Suite C100, Mountain View, CA >>>>> >>>>> On Tue, Jul 12, 2016 at 3:16 PM, Parth Chandra <[email protected]> >>>>> wrote: >>>>> >>>>>> Can you post the address? I'll try to join the morning session. >>>>>> >>>>>> On Mon, Jul 11, 2016 at 9:36 PM, Julien Le Dem <[email protected]> >>>> wrote: >>>>>> >>>>>>> Confirming that we’ll do the Parquet Hackathon this Thursday July >>>> 14th >>>>>>> Pacific time (GMT-7 in summer) >>>>>>> There will be a Google hangout (I’ll send an invite and a link) and >>>> an >>>>>> IRC >>>>>>> channel (parquet channel on irc.freenode.net) >>>>>>> The location is the Dremio office on Shoreline Blvd, Mountain View, >>>> CA >>>>>>> >>>>>>> Responded: >>>>>>> - Jason >>>>>>> - Julien >>>>>>> - Nezih >>>>>>> - Deepak >>>>>>> - Ryan >>>>>>> - Jacques >>>>>>> - Urvish >>>>>>> Will join remotely: >>>>>>> - Uwe (GMT+1 in the morning) >>>>>>> - Ferd (GMT+8, in the afternoon 3:30pm -> 9pm) >>>>>>> - Wes >>>>>>> >>>>>>> I’ll probably be on irc/hangout while on the train 8:33am -> 9:46am >>>> and >>>>>> be >>>>>>> there around 10am >>>>>>> There will be people to open the door earlier. >>>>>>> >>>>>>> Agenda/things that have been mentioned on the thread: >>>>>>> - Parquet <-> Arrow >>>>>>> - Parquet-cpp->Arrow-C++->PyArrow >>>>>>> - https://issues.apache.org/jira/browse/HIVE-8128 < >>>>>>> https://issues.apache.org/jira/browse/HIVE-8128> >>>>>>> - vectorized read in Drill >>>>>>> - >>>> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet >>>>>>> < >>>> https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet >>>>>>> - https://github.com/apache/parquet-mr/pull/257 < >>>>>>> https://github.com/apache/parquet-mr/pull/257> >>>>>>> >>>>>>> Feel free to add more/show up >>>>>>> >>>>>>>> On Jul 8, 2016, at 10:54 PM, Julien Le Dem <[email protected]> >>>>> wrote: >>>>>>>> >>>>>>>> There is the parquet channel on irc.freenode.net >>>>>>>> I'll set up a hangout as well. >>>>>>>> >>>>>>>>> On Fri, Jul 8, 2016 at 9:54 AM, Wes McKinney <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Do we yet have a Slack / IRC for Parquet? I will be joining >>>> remotely >>>>>>>>> throughout the day. Anyone who is interested in algorithms for >>>> Arrow >>>>>>>>> nested data <-> Parquet disassembly/reassembly, we should start a >>>>>>>>> shared Google document to detail algorithms and various test cases >>>>>>>>> we'll need to address in the implementation. >>>>>>>>> >>>>>>>>> On Wed, Jul 6, 2016 at 2:30 PM, Deepak Majeti < >>>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>> 14rth works for me too. I am mainly interested in vectorizing >>>>>>>>>> parquet-cpp as well. >>>>>>>>>> >>>>>>>>>> On Wed, Jul 6, 2016 at 4:50 PM, Nezih Yigitbasi >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> 14th works for me too. >>>>>>>>>>> >>>>>>>>>>> On Wed, Jul 6, 2016 at 12:54 AM Uwe Korn <[email protected]> >>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes, I'm GMT +1 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 05.07.16 18:52, Julien Le Dem wrote: >>>>>>>>>>>>> If there are people interested in the cpp implementation we’ll >>>>>> talk >>>>>>>>>>>> about that too. >>>>>>>>>>>>> I’m happy to give context or help with the encoding. In >>>>>> particular a >>>>>>>>>>>> Parquet -> Arrow vectorized converter would be great. >>>>>>>>>>>>> Are you GMT +1 ? >>>>>>>>>>>>> We can schedule a 1 hour slot in the morning for discussing >>>> with >>>>>>>>> remote >>>>>>>>>>>> folks in Europe. (same in afternoon if there are people joining >>>>>> from >>>>>>>>> Asia) >>>>>>>>>>>>> Julien >>>>>>>>>>>>> >>>>>>>>>>>>>> On Jul 5, 2016, at 2:37 AM, Uwe Korn <[email protected]> >>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> this effort is only for the parquet-mr project or would there >>>>>> also >>>>>>>>> be >>>>>>>>>>>> some work/benefit for parquet-cpp? If so, I might join briefly >>>>> in a >>>>>>>>> hangout >>>>>>>>>>>> but due to the timezone shift, I probably will not be able to >>>> be >>>>>>> awake >>>>>>>>> all >>>>>>>>>>>> the time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Uwe >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 02.07.16 01:01, Julien Le Dem wrote: >>>>>>>>>>>>>>> Dear Parquet dev list, >>>>>>>>>>>>>>> There have been efforts in several projects for vectorized >>>>> reads >>>>>>> of >>>>>>>>>>>> Parquet. >>>>>>>>>>>>>>> We had discussed during the Parquet sync up to organize a >>>>>>>>> hackathon to >>>>>>>>>>>>>>> brainstorm and look into a shared implementation. >>>>>>>>>>>>>>> Some projects that would benefit: >>>>>>>>>>>>>>> - Apache Drill >>>>>>>>>>>>>>> - Apache Arrow >>>>>>>>>>>>>>> - Apache Spark >>>>>>>>>>>>>>> - Presto >>>>>>>>>>>>>>> - Apache Hive >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm planning to organize this at the Dremio office in >>>> Mountain >>>>>>> View >>>>>>>>>>>> with >>>>>>>>>>>>>>> optionally a hangout for people who would want to join >>>>> remotely. >>>>>>>>>>>>>>> I'm adding to the "to:" people that have expressed interest >>>> or >>>>>>>>> could be >>>>>>>>>>>>>>> interested but that's not an exhaustive list. Please respond >>>>> to >>>>>>>>> this >>>>>>>>>>>> email >>>>>>>>>>>>>>> if you wish to be included. >>>>>>>>>>>>>>> Who's interested and what dates would work between this >>>>> Tuesday >>>>>>>>> 7/5 and >>>>>>>>>>>>>>> Wednesday 7/20 ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> regards, >>>>>>>>>> Deepak Majeti >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Julien >>>> >>>> >>>> >>>> -- >>>> Julien >>>> >>
