Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Micah Kornfield Thu, 16 Apr 2020 21:29:13 -0700

Hi Wes,
Thanks that seems like a good characterization.  I opened up some JIRA
subtasks on ARROW-1644 which go into a little more detail on tasks that can
probably be worked on in parallel (I've only assigned ones to myself
that I'm actively working on, happy to add discuss/collaborate on the finer
points on the JIRAs).  There will probably be a few more JIRAs to open to
do final integration work (e.g. a flag to switch between old and new
engines).


For unit tests (Item B).  as noted earlier in the thread there is already a
disabled unit test trying to verify basic ability to round-trip but that
probably isn't sufficient.

Thanks,
Micah

On Wed, Apr 15, 2020 at 9:32 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah,
>
> Sounds good. It seems like there are a few projects where people might
> be able to work without stepping on each other's toes
>
> A. Array reassembly from raw repetition/definition levels (I would
> guess this would be your focus)
> B. Schema and data generation for round-trip correctness and
> performance testing (I reckon that the unit tests for A will largely
> be hand-written examples like you did for the write path)
> C. Benchmarks, particularly to be able to assess performance changes
> going from the old incomplete implementations to the new ones
>
> Some of us should be able to pitch in to help with this. Might also be
> a good opportunity to do some cleanup of the test code in
> cpp/src/parquet/arrow
>
> - Wes
>
> On Tue, Apr 14, 2020 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Hi Wes,
> > Yes, I'm making progress and at this point I anticipate being able to
> finish it off by next release, possibly without support for round tripping
> fixed size lists.  I've been spending some time thinking about different
> approaches and have started coding some of the building blocks, which I
> think in the common case (relatively low nesting levels) should be fairly
> performant (I'm also going to write some benchmarks to sanity check this).
> One caveat to this is my schedule is going to change slightly next week and
> its possible my bandwidth might be more limited, I'll update the list if
> this happens.
> >
> > I think there are at least two areas that I'm not working on that could
> be parallelized if you or your team has bandwidth.
> >
> > 1. It would be good to have some parquet files representing real world
> datasets available to benchmark against.
> > 2. The higher level book keeping of tracking which def-levels/rep-levels
> are needed to compare against for any particular column (i.e. preceding
> repeated parent).  I'm currently working on the code that takes these and
> converts them to offsets/null fields.
> >
> > I can go into more details if you or your team would like to collaborate.
> >
> > Thanks,
> > Micah
> >
> > On Tue, Apr 14, 2020 at 7:48 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >> hi Micah,
> >>
> >> I'm glad that we have the write side of nested completed for 0.17.0.
> >>
> >> As far as completing the read side and then implementing sufficient
> >> testing to exercise corner cases in end-to-end reads/writes, do you
> >> anticipate being able to work on this in the next 4-6 weeks (obviously
> >> the state of the world has affected everyone's availability /
> >> bandwidth)? I ask because someone from my team (or me also) may be
> >> able to get involved and help this move along. It'd be great to have
> >> this 100% completed and checked off our list for the next release
> >> (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration
> >> tests get completed also)
> >>
> >> thanks
> >> Wes
> >>
> >> On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >> >>
> >> >> Glad to hear about the progress. As I mentioned on #2, what do you
> >> >> think about setting up a feature branch for you to merge PRs into?
> >> >> Then the branch can be iterated on and we can merge it back when it's
> >> >> feature complete and does not have perf regressions for the flat
> >> >> read/write path.
> >> >>
> >> > I'd like to avoid a separate branch if possible.  I'm willing to
> close the open PR till I'm sure it is needed but I'm hoping keeping PRs as
> small focused as possible with performance testing a long the way will be a
> better reviewer and developer experience here.
> >> >
> >> >> The earliest I'd have time to work on this myself would likely be
> >> >> sometime in March. Others are welcome to jump in as well (and it'd be
> >> >> great to increase the overall level of knowledge of the Parquet
> >> >> codebase)
> >> >
> >> > Hopefully, Igor can help out otherwise I'll take up the read path
> after I finish the write path.
> >> >
> >> > -Micah
> >> >
> >> > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >> >>
> >> >> hi Micah
> >> >>
> >> >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> >> >> >
> >> >> > Just to give an update.  I've been a little bit delayed, but my
> progress is
> >> >> > as follows:
> >> >> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.
> >> >> > 2.  Have another PR open that allows a configuration option in C++
> to
> >> >> > determine which algorithm version to use for reading/writing, the
> existing
> >> >> > version and the new version supported complex-nested arrays.  I
> think a
> >> >> > large amount of code will be reused/delegated to but I will err on
> the side
> >> >> > of not touching the existing code/algorithms so that any errors in
> the
> >> >> > implementation  or performance regressions can hopefully be
> mitigated at
> >> >> > runtime.  I expect in later releases (once the code has "baked")
> will
> >> >> > become a no-op.
> >> >>
> >> >> Glad to hear about the progress. As I mentioned on #2, what do you
> >> >> think about setting up a feature branch for you to merge PRs into?
> >> >> Then the branch can be iterated on and we can merge it back when it's
> >> >> feature complete and does not have perf regressions for the flat
> >> >> read/write path.
> >> >>
> >> >> > 3.  Started coding the write path.
> >> >> >
> >> >> > Which leaves:
> >> >> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code
> complete
> >> >> > 2.  Implementing the read path.
> >> >>
> >> >> The earliest I'd have time to work on this myself would likely be
> >> >> sometime in March. Others are welcome to jump in as well (and it'd be
> >> >> great to increase the overall level of knowledge of the Parquet
> >> >> codebase)
> >> >>
> >> >> > Again, I'm happy to collaborate if people have bandwidth and want
> to
> >> >> > contribute.
> >> >> >
> >> >> > Thanks,
> >> >> > Micah
> >> >> >
> >> >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <
> emkornfi...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> > > Hi Wes,
> >> >> > > I'm still interested in doing the work.  But don't to hold
> anybody up if
> >> >> > > they have bandwidth.
> >> >> > >
> >> >> > > In order to actually make progress on this, my plan will be to:
> >> >> > > 1.  Help with the current Java review backlog through early next
> week or
> >> >> > > so (this has been taking the majority of my time allocated for
> Arrow
> >> >> > > contributions for the last 6 months or so).
> >> >> > > 2.  Shift all my attention to trying to get this done (this
> means no
> >> >> > > reviews other then closing out existing ones that I've started
> until it is
> >> >> > > done).  Hopefully, other Java committers can help shrink the
> backlog
> >> >> > > further (Jacques thanks for you recent efforts here).
> >> >> > >
> >> >> > > Thanks,
> >> >> > > Micah
> >> >> > >
> >> >> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >> >> > >
> >> >> > >> hi folks,
> >> >> > >>
> >> >> > >> I think we have reached a point where the incomplete C++ Parquet
> >> >> > >> nested data assembly/disassembly is harming the value of several
> >> >> > >> others parts of the project, for example the Datasets API. As
> another
> >> >> > >> example, it's possible to ingest nested data from JSON but not
> write
> >> >> > >> it to Parquet in general.
> >> >> > >>
> >> >> > >> Implementing the nested data read and write path completely is a
> >> >> > >> difficult project requiring at least several weeks of dedicated
> work,
> >> >> > >> so it's not so surprising that it hasn't been accomplished yet.
> I know
> >> >> > >> that several people have expressed interest in working on it,
> but I
> >> >> > >> would like to see if anyone would be able to volunteer a
> commitment of
> >> >> > >> time and guess on a rough timeline when this work could be
> done. It
> >> >> > >> seems to me if this slips beyond 2020 it will significant
> diminish the
> >> >> > >> value being created by other parts of the project.
> >> >> > >>
> >> >> > >> Since I'm pretty familiar with all the Parquet code I'm one
> candidate
> >> >> > >> person to take on this project (and I can dedicate the time,
> but it
> >> >> > >> would come at the expense of other projects where I can also be
> >> >> > >> useful). But Micah and others expressed interest in working on
> it, so
> >> >> > >> I wanted to have a discussion about it to see what others think.
> >> >> > >>
> >> >> > >> Thanks
> >> >> > >> Wes
> >> >> > >>
> >> >> > >
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to