Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Wes McKinney Fri, 13 Mar 2020 06:45:01 -0700

On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Maarten, I don't expect regressions for flat cases (I'm going to try to run
> benchmarks comparison tonight).
>
> In terms of the flag, I'm more concerned about some corner case I didn't
> think of in testing or a workload that for some reason is better with the
> prior code. If either of these arise I would like to give users a way to
> mitigate issues so a patch release isn't an imperative.


I'd be OK with having the flag so long as the new code is the default.
Otherwise we'll never find out about the corner cases.

> I'm happy to cleanup the existing code as soon as the next release happens
> though.
>
> Thoughts?
>
> -Micah
>
> On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Maarten -- AFAIK Micah's work only affects nested / non-flat column
> > paths, so flat data should not be impacted. Since we have a partial
> > implementation of writes for nested data (lists-of-lists and
> > structs-of-structs, but no mix of the two) that was the performance
> > difference I was referencing.
> >
> > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn <maart...@xs4all.nl>
> > wrote:
> > >
> > > Hi Micah,
> > >
> > > How does the performance change for “flat” schemas?
> > > (particularly in the case of a large number of columns)
> > >
> > > Thanks,
> > > Maarten
> > >
> > >
> > >
> > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > > >
> > > > Another status update.  I've integrated the level generation code with
> > the
> > > > parquet writing code [1].
> > > >
> > > > After that PR is merged I'll add bindings in Python to control
> > versions of
> > > > the level generation algorithm and plan on moving on to the read side.
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] https://github.com/apache/arrow/pull/6586
> > > >
> > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi Igor,
> > > >> If you have the time https://issues.apache.org/jira/browse/ARROW-7960
> > might
> > > >> be a good task to pick up for this I think it should be a relatively
> > small
> > > >> amount of code, so it is probably a good contribution to the
> > project.  Once
> > > >> that is wrapped up we can see were we both are.
> > > >>
> > > >> Cheers,
> > > >> Micah
> > > >>
> > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <igor.calab...@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >>> Hi Micah, I actually got involved with another personal project and
> > had
> > > >>> to postpone my contribution to arrow a bit. The good news is that I'm
> > > >>> almost done with it, so I could help you with the read side very
> > soon. Any
> > > >>> ideas how we could coordinate this?
> > > >>>
> > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <
> > wesmck...@gmail.com>
> > > >>> escreveu:
> > > >>>
> > > >>>> hi Micah -- great news on the level generation PR. I'll try to carve
> > > >>>> out some time for reviewing over the coming week.
> > > >>>>
> > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <
> > emkornfi...@gmail.com>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> Hi Igor,
> > > >>>>> I was wondering if you have made any progress on this?
> > > >>>>>
> > > >>>>> I posted a new PR [1] which I believe handles the difficult
> > algorithmic
> > > >>>>> part of writing.  There will be some follow-ups but I think this PR
> > > >>>> might
> > > >>>>> take a while to review, so I was thinking of starting to take a
> > look
> > > >>>> at the
> > > >>>>> read side if you haven't started yet, and circle back to the final
> > > >>>>> integration for the write side once the PR is checked in.
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Micah
> > > >>>>>
> > > >>>>> [1] https://github.com/apache/arrow/pull/6490
> > > >>>>>
> > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <
> > igor.calab...@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hi, I would love to help with this issue. I'm aware that this is a
> > > >>>> huge
> > > >>>>>> task for a first contribution to arrow, but I feel that I could
> > help
> > > >>>> with
> > > >>>>>> the read path.
> > > >>>>>> Reading parquet seems like a extremely complex task since both
> > > >>>> hive[0] and
> > > >>>>>> spark[1] tried to implement a "vectorized" version and they all
> > > >>>> stopped
> > > >>>>>> short of supporting complex types.
> > > >>>>>> I wanted to at least give it a try and find out where the
> > challenge
> > > >>>> lies.
> > > >>>>>>
> > > >>>>>> Since you guys are much more familiar with the current code base,
> > I
> > > >>>> could
> > > >>>>>> use some starting tips so I don't fall in common pitfalls and
> > > >>>> whatnot.
> > > >>>>>>
> > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576
> > > >>>>>> [1]
> > > >>>>>>
> > > >>>>>>
> > > >>>>
> > https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
> > > >>>>>>
> > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote:
> > > >>>>>>> Just to give an update.  I've been a little bit delayed, but my
> > > >>>> progress
> > > >>>>>> is>
> > > >>>>>>> as follows:>
> > > >>>>>>> 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
> > > >>>>>>> 2.  Have another PR open that allows a configuration option in
> > C++
> > > >>>> to>
> > > >>>>>>> determine which algorithm version to use for reading/writing, the
> > > >>>>>> existing>
> > > >>>>>>> version and the new version supported complex-nested arrays.  I
> > > >>>> think a>
> > > >>>>>>> large amount of code will be reused/delegated to but I will err
> > on
> > > >>>> the
> > > >>>>>> side>
> > > >>>>>>> of not touching the existing code/algorithms so that any errors
> > in
> > > >>>> the>
> > > >>>>>>> implementation  or performance regressions can hopefully be
> > > >>>> mitigated at>
> > > >>>>>>> runtime.  I expect in later releases (once the code has "baked")
> > > >>>> will>
> > > >>>>>>> become a no-op.>
> > > >>>>>>> 3.  Started coding the write path.>
> > > >>>>>>>
> > > >>>>>>> Which leaves:>
> > > >>>>>>> 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> > > >>>> complete>
> > > >>>>>>> 2.  Implementing the read path.>
> > > >>>>>>>
> > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth and want
> > > >>>> to>
> > > >>>>>>> contribute.>
> > > >>>>>>>
> > > >>>>>>> Thanks,>
> > > >>>>>>> Micah>
> > > >>>>>>>
> > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com
> > >>
> > > >>>>>>> wrote:>
> > > >>>>>>>
> > > >>>>>>>> Hi Wes,>
> > > >>>>>>>> I'm still interested in doing the work.  But don't to hold
> > > >>>> anybody up
> > > >>>>>> if>
> > > >>>>>>>> they have bandwidth.>
> > > >>>>>>>>>
> > > >>>>>>>> In order to actually make progress on this, my plan will be to:>
> > > >>>>>>>> 1.  Help with the current Java review backlog through early next
> > > >>>> week
> > > >>>>>> or>
> > > >>>>>>>> so (this has been taking the majority of my time allocated for
> > > >>>> Arrow>
> > > >>>>>>>> contributions for the last 6 months or so).>
> > > >>>>>>>> 2.  Shift all my attention to trying to get this done (this
> > > >>>> means no>
> > > >>>>>>>> reviews other then closing out existing ones that I've started
> > > >>>> until it
> > > >>>>>> is>
> > > >>>>>>>> done).  Hopefully, other Java committers can help shrink the
> > > >>>> backlog>
> > > >>>>>>>> further (Jacques thanks for you recent efforts here).>
> > > >>>>>>>>>
> > > >>>>>>>> Thanks,>
> > > >>>>>>>> Micah>
> > > >>>>>>>>>
> > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com>
> > > >>>> wrote:>
> > > >>>>>>>>>
> > > >>>>>>>>> hi folks,>
> > > >>>>>>>>>>
> > > >>>>>>>>> I think we have reached a point where the incomplete C++
> > > >>>> Parquet>
> > > >>>>>>>>> nested data assembly/disassembly is harming the value of
> > > >>>> several>
> > > >>>>>>>>> others parts of the project, for example the Datasets API. As
> > > >>>> another>
> > > >>>>>>>>> example, it's possible to ingest nested data from JSON but not
> > > >>>> write>
> > > >>>>>>>>> it to Parquet in general.>
> > > >>>>>>>>>>
> > > >>>>>>>>> Implementing the nested data read and write path completely is
> > > >>>> a>
> > > >>>>>>>>> difficult project requiring at least several weeks of dedicated
> > > >>>> work,>
> > > >>>>>>>>> so it's not so surprising that it hasn't been accomplished yet.
> > > >>>> I
> > > >>>>>> know>
> > > >>>>>>>>> that several people have expressed interest in working on it,
> > > >>>> but I>
> > > >>>>>>>>> would like to see if anyone would be able to volunteer a
> > > >>>> commitment
> > > >>>>>> of>
> > > >>>>>>>>> time and guess on a rough timeline when this work could be
> > > >>>> done. It>
> > > >>>>>>>>> seems to me if this slips beyond 2020 it will significant
> > > >>>> diminish
> > > >>>>>> the>
> > > >>>>>>>>> value being created by other parts of the project.>
> > > >>>>>>>>>>
> > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one
> > > >>>> candidate>
> > > >>>>>>>>> person to take on this project (and I can dedicate the time,
> > > >>>> but it>
> > > >>>>>>>>> would come at the expense of other projects where I can also
> > be>
> > > >>>>>>>>> useful). But Micah and others expressed interest in working on
> > > >>>> it, so>
> > > >>>>>>>>> I wanted to have a discussion about it to see what others
> > > >>>> think.>
> > > >>>>>>>>>>
> > > >>>>>>>>> Thanks>
> > > >>>>>>>>> Wes>
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>
> > >
> >

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to