Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Wes McKinney Fri, 17 Apr 2020 10:25:22 -0700

Sounds good.

In general I would say that this is a good opportunity to make
improvements around random data generation. For example, I don't think
we have an API for generating a RecordBatch given a schema and some
options (e.g. probability of nulls, distribution of list sizes), for
example, but that would be a good thing to have to assist both with
perf and correctness testing.


On Thu, Apr 16, 2020 at 11:28 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Wes,
> Thanks that seems like a good characterization.  I opened up some JIRA 
> subtasks on ARROW-1644 which go into a little more detail on tasks that can 
> probably be worked on in parallel (I've only assigned ones to myself that I'm 
> actively working on, happy to add discuss/collaborate on the finer points on 
> the JIRAs).  There will probably be a few more JIRAs to open to do final 
> integration work (e.g. a flag to switch between old and new engines).
>
> For unit tests (Item B).  as noted earlier in the thread there is already a 
> disabled unit test trying to verify basic ability to round-trip but that 
> probably isn't sufficient.
>
> Thanks,
> Micah
>
> On Wed, Apr 15, 2020 at 9:32 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> hi Micah,
>>
>> Sounds good. It seems like there are a few projects where people might
>> be able to work without stepping on each other's toes
>>
>> A. Array reassembly from raw repetition/definition levels (I would
>> guess this would be your focus)
>> B. Schema and data generation for round-trip correctness and
>> performance testing (I reckon that the unit tests for A will largely
>> be hand-written examples like you did for the write path)
>> C. Benchmarks, particularly to be able to assess performance changes
>> going from the old incomplete implementations to the new ones
>>
>> Some of us should be able to pitch in to help with this. Might also be
>> a good opportunity to do some cleanup of the test code in
>> cpp/src/parquet/arrow
>>
>> - Wes
>>
>> On Tue, Apr 14, 2020 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com> 
>> wrote:
>> >
>> > Hi Wes,
>> > Yes, I'm making progress and at this point I anticipate being able to 
>> > finish it off by next release, possibly without support for round tripping 
>> > fixed size lists.  I've been spending some time thinking about different 
>> > approaches and have started coding some of the building blocks, which I 
>> > think in the common case (relatively low nesting levels) should be fairly 
>> > performant (I'm also going to write some benchmarks to sanity check this). 
>> >  One caveat to this is my schedule is going to change slightly next week 
>> > and its possible my bandwidth might be more limited, I'll update the list 
>> > if this happens.
>> >
>> > I think there are at least two areas that I'm not working on that could be 
>> > parallelized if you or your team has bandwidth.
>> >
>> > 1. It would be good to have some parquet files representing real world 
>> > datasets available to benchmark against.
>> > 2. The higher level book keeping of tracking which def-levels/rep-levels 
>> > are needed to compare against for any particular column (i.e. preceding 
>> > repeated parent).  I'm currently working on the code that takes these and 
>> > converts them to offsets/null fields.
>> >
>> > I can go into more details if you or your team would like to collaborate.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Tue, Apr 14, 2020 at 7:48 AM Wes McKinney <wesmck...@gmail.com> wrote:
>> >>
>> >> hi Micah,
>> >>
>> >> I'm glad that we have the write side of nested completed for 0.17.0.
>> >>
>> >> As far as completing the read side and then implementing sufficient
>> >> testing to exercise corner cases in end-to-end reads/writes, do you
>> >> anticipate being able to work on this in the next 4-6 weeks (obviously
>> >> the state of the world has affected everyone's availability /
>> >> bandwidth)? I ask because someone from my team (or me also) may be
>> >> able to get involved and help this move along. It'd be great to have
>> >> this 100% completed and checked off our list for the next release
>> >> (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration
>> >> tests get completed also)
>> >>
>> >> thanks
>> >> Wes
>> >>
>> >> On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield <emkornfi...@gmail.com> 
>> >> wrote:
>> >> >>
>> >> >> Glad to hear about the progress. As I mentioned on #2, what do you
>> >> >> think about setting up a feature branch for you to merge PRs into?
>> >> >> Then the branch can be iterated on and we can merge it back when it's
>> >> >> feature complete and does not have perf regressions for the flat
>> >> >> read/write path.
>> >> >>
>> >> > I'd like to avoid a separate branch if possible.  I'm willing to close 
>> >> > the open PR till I'm sure it is needed but I'm hoping keeping PRs as 
>> >> > small focused as possible with performance testing a long the way will 
>> >> > be a better reviewer and developer experience here.
>> >> >
>> >> >> The earliest I'd have time to work on this myself would likely be
>> >> >> sometime in March. Others are welcome to jump in as well (and it'd be
>> >> >> great to increase the overall level of knowledge of the Parquet
>> >> >> codebase)
>> >> >
>> >> > Hopefully, Igor can help out otherwise I'll take up the read path after 
>> >> > I finish the write path.
>> >> >
>> >> > -Micah
>> >> >
>> >> > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <wesmck...@gmail.com> wrote:
>> >> >>
>> >> >> hi Micah
>> >> >>
>> >> >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield 
>> >> >> <emkornfi...@gmail.com> wrote:
>> >> >> >
>> >> >> > Just to give an update.  I've been a little bit delayed, but my 
>> >> >> > progress is
>> >> >> > as follows:
>> >> >> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.
>> >> >> > 2.  Have another PR open that allows a configuration option in C++ to
>> >> >> > determine which algorithm version to use for reading/writing, the 
>> >> >> > existing
>> >> >> > version and the new version supported complex-nested arrays.  I 
>> >> >> > think a
>> >> >> > large amount of code will be reused/delegated to but I will err on 
>> >> >> > the side
>> >> >> > of not touching the existing code/algorithms so that any errors in 
>> >> >> > the
>> >> >> > implementation  or performance regressions can hopefully be 
>> >> >> > mitigated at
>> >> >> > runtime.  I expect in later releases (once the code has "baked") will
>> >> >> > become a no-op.
>> >> >>
>> >> >> Glad to hear about the progress. As I mentioned on #2, what do you
>> >> >> think about setting up a feature branch for you to merge PRs into?
>> >> >> Then the branch can be iterated on and we can merge it back when it's
>> >> >> feature complete and does not have perf regressions for the flat
>> >> >> read/write path.
>> >> >>
>> >> >> > 3.  Started coding the write path.
>> >> >> >
>> >> >> > Which leaves:
>> >> >> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code 
>> >> >> > complete
>> >> >> > 2.  Implementing the read path.
>> >> >>
>> >> >> The earliest I'd have time to work on this myself would likely be
>> >> >> sometime in March. Others are welcome to jump in as well (and it'd be
>> >> >> great to increase the overall level of knowledge of the Parquet
>> >> >> codebase)
>> >> >>
>> >> >> > Again, I'm happy to collaborate if people have bandwidth and want to
>> >> >> > contribute.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Micah
>> >> >> >
>> >> >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield 
>> >> >> > <emkornfi...@gmail.com>
>> >> >> > wrote:
>> >> >> >
>> >> >> > > Hi Wes,
>> >> >> > > I'm still interested in doing the work.  But don't to hold anybody 
>> >> >> > > up if
>> >> >> > > they have bandwidth.
>> >> >> > >
>> >> >> > > In order to actually make progress on this, my plan will be to:
>> >> >> > > 1.  Help with the current Java review backlog through early next 
>> >> >> > > week or
>> >> >> > > so (this has been taking the majority of my time allocated for 
>> >> >> > > Arrow
>> >> >> > > contributions for the last 6 months or so).
>> >> >> > > 2.  Shift all my attention to trying to get this done (this means 
>> >> >> > > no
>> >> >> > > reviews other then closing out existing ones that I've started 
>> >> >> > > until it is
>> >> >> > > done).  Hopefully, other Java committers can help shrink the 
>> >> >> > > backlog
>> >> >> > > further (Jacques thanks for you recent efforts here).
>> >> >> > >
>> >> >> > > Thanks,
>> >> >> > > Micah
>> >> >> > >
>> >> >> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <wesmck...@gmail.com> 
>> >> >> > > wrote:
>> >> >> > >
>> >> >> > >> hi folks,
>> >> >> > >>
>> >> >> > >> I think we have reached a point where the incomplete C++ Parquet
>> >> >> > >> nested data assembly/disassembly is harming the value of several
>> >> >> > >> others parts of the project, for example the Datasets API. As 
>> >> >> > >> another
>> >> >> > >> example, it's possible to ingest nested data from JSON but not 
>> >> >> > >> write
>> >> >> > >> it to Parquet in general.
>> >> >> > >>
>> >> >> > >> Implementing the nested data read and write path completely is a
>> >> >> > >> difficult project requiring at least several weeks of dedicated 
>> >> >> > >> work,
>> >> >> > >> so it's not so surprising that it hasn't been accomplished yet. I 
>> >> >> > >> know
>> >> >> > >> that several people have expressed interest in working on it, but 
>> >> >> > >> I
>> >> >> > >> would like to see if anyone would be able to volunteer a 
>> >> >> > >> commitment of
>> >> >> > >> time and guess on a rough timeline when this work could be done. 
>> >> >> > >> It
>> >> >> > >> seems to me if this slips beyond 2020 it will significant 
>> >> >> > >> diminish the
>> >> >> > >> value being created by other parts of the project.
>> >> >> > >>
>> >> >> > >> Since I'm pretty familiar with all the Parquet code I'm one 
>> >> >> > >> candidate
>> >> >> > >> person to take on this project (and I can dedicate the time, but 
>> >> >> > >> it
>> >> >> > >> would come at the expense of other projects where I can also be
>> >> >> > >> useful). But Micah and others expressed interest in working on 
>> >> >> > >> it, so
>> >> >> > >> I wanted to have a discussion about it to see what others think.
>> >> >> > >>
>> >> >> > >> Thanks
>> >> >> > >> Wes
>> >> >> > >>
>> >> >> > >

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to