Here is the PR with the code: https://github.com/apache/parquet-mr/pull/356
On Fri, Jul 29, 2016 at 2:59 PM, Julien Le Dem <[email protected]> wrote: > I will send a pull request soon with the code. > Repetition levels are redundant as they encode information from the > parents in the leaf nodes. > I haven’t looked into that yet but we could make some versions of the code > that ignore the parent nodes for the other leaves. > Julien > > > On Jul 28, 2016, at 2:28 PM, Wes McKinney <[email protected]> wrote: > > > > Hi Julien, > > > > This is great to hear. Do you have code or an algorithm sketch for the > > conversion? I would like to work on the C++ Parquet to Arrow > > vectorized conversion in the next few months. One of the things I > > haven't thought through is how to jointly decode leaf nodes that are > > part of the same branch (e.g. foo.bar.baz and foo.bar.qux together) > > without redundant computation (perhaps this is what you're alluding > > too). > > > > Thanks, > > Wes > > > > On Sat, Jul 16, 2016 at 9:49 PM, Julien Le Dem <[email protected]> wrote: > >> On my end I did a few versions of vectorized conversion from parquet > definition levels to arrow offsets. > >> Some tricks to avoid branching work well. > >> I'll publish something soon. > >> > >> Julien > >> > >>> On Jul 15, 2016, at 19:04, Jacques Nadeau <[email protected]> wrote: > >>> > >>> Hello, > >>> > >>> I had a great time at the Hackathon. Thanks to Julien for putting this > >>> together! Thanks to everyone who joined. > >>> > >>> There were some good discussions and some exploration work. I started > >>> exploring a paradigm for supporting a zero performance impact > abstraction > >>> approach to on and off heap access currently named slyheap. I'm > exploring > >>> using sentinel objects and bytecode rewriting to avoid extra > indirections > >>> for primitive arrays when wanting to swap out to ArrowBuf. I'll be out > of > >>> of town over the next week but will try to post some progress on this > the > >>> following week. > >>> > >>> thanks, > >>> Jacques > >>> > >>> > >>>> On Thu, Jul 14, 2016 at 8:45 AM, Julien Le Dem <[email protected]> > wrote: > >>>> > >>>> I'm currently in > >>>> - the hangout: > https://hangouts.google.com/hangouts/_/dremio.com/parquet > >>>> - the irc channel parquet on irc.freenode.net > >>>> > >>>> On Tue, Jul 12, 2016 at 4:04 PM, Jacques Nadeau <[email protected]> > >>>> wrote: > >>>> > >>>>> 883 N Shoreline Blvd, Suite C100, Mountain View, CA > >>>>> > >>>>> On Tue, Jul 12, 2016 at 3:16 PM, Parth Chandra < > [email protected]> > >>>>> wrote: > >>>>> > >>>>>> Can you post the address? I'll try to join the morning session. > >>>>>> > >>>>>> On Mon, Jul 11, 2016 at 9:36 PM, Julien Le Dem <[email protected]> > >>>> wrote: > >>>>>> > >>>>>>> Confirming that we’ll do the Parquet Hackathon this Thursday July > >>>> 14th > >>>>>>> Pacific time (GMT-7 in summer) > >>>>>>> There will be a Google hangout (I’ll send an invite and a link) and > >>>> an > >>>>>> IRC > >>>>>>> channel (parquet channel on irc.freenode.net) > >>>>>>> The location is the Dremio office on Shoreline Blvd, Mountain View, > >>>> CA > >>>>>>> > >>>>>>> Responded: > >>>>>>> - Jason > >>>>>>> - Julien > >>>>>>> - Nezih > >>>>>>> - Deepak > >>>>>>> - Ryan > >>>>>>> - Jacques > >>>>>>> - Urvish > >>>>>>> Will join remotely: > >>>>>>> - Uwe (GMT+1 in the morning) > >>>>>>> - Ferd (GMT+8, in the afternoon 3:30pm -> 9pm) > >>>>>>> - Wes > >>>>>>> > >>>>>>> I’ll probably be on irc/hangout while on the train 8:33am -> 9:46am > >>>> and > >>>>>> be > >>>>>>> there around 10am > >>>>>>> There will be people to open the door earlier. > >>>>>>> > >>>>>>> Agenda/things that have been mentioned on the thread: > >>>>>>> - Parquet <-> Arrow > >>>>>>> - Parquet-cpp->Arrow-C++->PyArrow > >>>>>>> - https://issues.apache.org/jira/browse/HIVE-8128 < > >>>>>>> https://issues.apache.org/jira/browse/HIVE-8128> > >>>>>>> - vectorized read in Drill > >>>>>>> - > >>>> > https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet > >>>>>>> < > >>>> > https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet > >>>>>>> - https://github.com/apache/parquet-mr/pull/257 < > >>>>>>> https://github.com/apache/parquet-mr/pull/257> > >>>>>>> > >>>>>>> Feel free to add more/show up > >>>>>>> > >>>>>>>> On Jul 8, 2016, at 10:54 PM, Julien Le Dem <[email protected]> > >>>>> wrote: > >>>>>>>> > >>>>>>>> There is the parquet channel on irc.freenode.net > >>>>>>>> I'll set up a hangout as well. > >>>>>>>> > >>>>>>>>> On Fri, Jul 8, 2016 at 9:54 AM, Wes McKinney < > [email protected]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Do we yet have a Slack / IRC for Parquet? I will be joining > >>>> remotely > >>>>>>>>> throughout the day. Anyone who is interested in algorithms for > >>>> Arrow > >>>>>>>>> nested data <-> Parquet disassembly/reassembly, we should start a > >>>>>>>>> shared Google document to detail algorithms and various test > cases > >>>>>>>>> we'll need to address in the implementation. > >>>>>>>>> > >>>>>>>>> On Wed, Jul 6, 2016 at 2:30 PM, Deepak Majeti < > >>>>>> [email protected]> > >>>>>>>>> wrote: > >>>>>>>>>> 14rth works for me too. I am mainly interested in vectorizing > >>>>>>>>>> parquet-cpp as well. > >>>>>>>>>> > >>>>>>>>>> On Wed, Jul 6, 2016 at 4:50 PM, Nezih Yigitbasi > >>>>>>>>>> <[email protected]> wrote: > >>>>>>>>>>> 14th works for me too. > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Jul 6, 2016 at 12:54 AM Uwe Korn <[email protected]> > >>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Yes, I'm GMT +1 > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> On 05.07.16 18:52, Julien Le Dem wrote: > >>>>>>>>>>>>> If there are people interested in the cpp implementation > we’ll > >>>>>> talk > >>>>>>>>>>>> about that too. > >>>>>>>>>>>>> I’m happy to give context or help with the encoding. In > >>>>>> particular a > >>>>>>>>>>>> Parquet -> Arrow vectorized converter would be great. > >>>>>>>>>>>>> Are you GMT +1 ? > >>>>>>>>>>>>> We can schedule a 1 hour slot in the morning for discussing > >>>> with > >>>>>>>>> remote > >>>>>>>>>>>> folks in Europe. (same in afternoon if there are people > joining > >>>>>> from > >>>>>>>>> Asia) > >>>>>>>>>>>>> Julien > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On Jul 5, 2016, at 2:37 AM, Uwe Korn <[email protected]> > >>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hello, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> this effort is only for the parquet-mr project or would > there > >>>>>> also > >>>>>>>>> be > >>>>>>>>>>>> some work/benefit for parquet-cpp? If so, I might join briefly > >>>>> in a > >>>>>>>>> hangout > >>>>>>>>>>>> but due to the timezone shift, I probably will not be able to > >>>> be > >>>>>>> awake > >>>>>>>>> all > >>>>>>>>>>>> the time. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Uwe > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 02.07.16 01:01, Julien Le Dem wrote: > >>>>>>>>>>>>>>> Dear Parquet dev list, > >>>>>>>>>>>>>>> There have been efforts in several projects for vectorized > >>>>> reads > >>>>>>> of > >>>>>>>>>>>> Parquet. > >>>>>>>>>>>>>>> We had discussed during the Parquet sync up to organize a > >>>>>>>>> hackathon to > >>>>>>>>>>>>>>> brainstorm and look into a shared implementation. > >>>>>>>>>>>>>>> Some projects that would benefit: > >>>>>>>>>>>>>>> - Apache Drill > >>>>>>>>>>>>>>> - Apache Arrow > >>>>>>>>>>>>>>> - Apache Spark > >>>>>>>>>>>>>>> - Presto > >>>>>>>>>>>>>>> - Apache Hive > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'm planning to organize this at the Dremio office in > >>>> Mountain > >>>>>>> View > >>>>>>>>>>>> with > >>>>>>>>>>>>>>> optionally a hangout for people who would want to join > >>>>> remotely. > >>>>>>>>>>>>>>> I'm adding to the "to:" people that have expressed interest > >>>> or > >>>>>>>>> could be > >>>>>>>>>>>>>>> interested but that's not an exhaustive list. Please > respond > >>>>> to > >>>>>>>>> this > >>>>>>>>>>>> email > >>>>>>>>>>>>>>> if you wish to be included. > >>>>>>>>>>>>>>> Who's interested and what dates would work between this > >>>>> Tuesday > >>>>>>>>> 7/5 and > >>>>>>>>>>>>>>> Wednesday 7/20 ? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> regards, > >>>>>>>>>> Deepak Majeti > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Julien > >>>> > >>>> > >>>> > >>>> -- > >>>> Julien > >>>> > >> > > -- Julien
