Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Micah Kornfield Sun, 02 Feb 2020 22:02:25 -0800

Just to give an update.  I've been a little bit delayed, but my progress is
as follows:
1.  Had 1 PR merged that will exercise basic end-to-end tests.
2.  Have another PR open that allows a configuration option in C++ to
determine which algorithm version to use for reading/writing, the existing
version and the new version supported complex-nested arrays.  I think a
large amount of code will be reused/delegated to but I will err on the side
of not touching the existing code/algorithms so that any errors in the
implementation  or performance regressions can hopefully be mitigated at
runtime.  I expect in later releases (once the code has "baked") will
become a no-op.
3.  Started coding the write path.


Which leaves:
1.  Finishing the write path (I estimate 2-3 weeks) to be code complete
2.  Implementing the read path.

Again, I'm happy to collaborate if people have bandwidth and want to
contribute.

Thanks,
Micah

On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <[email protected]>
wrote:

> Hi Wes,
> I'm still interested in doing the work.  But don't to hold anybody up if
> they have bandwidth.
>
> In order to actually make progress on this, my plan will be to:
> 1.  Help with the current Java review backlog through early next week or
> so (this has been taking the majority of my time allocated for Arrow
> contributions for the last 6 months or so).
> 2.  Shift all my attention to trying to get this done (this means no
> reviews other then closing out existing ones that I've started until it is
> done).  Hopefully, other Java committers can help shrink the backlog
> further (Jacques thanks for you recent efforts here).
>
> Thanks,
> Micah
>
> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <[email protected]> wrote:
>
>> hi folks,
>>
>> I think we have reached a point where the incomplete C++ Parquet
>> nested data assembly/disassembly is harming the value of several
>> others parts of the project, for example the Datasets API. As another
>> example, it's possible to ingest nested data from JSON but not write
>> it to Parquet in general.
>>
>> Implementing the nested data read and write path completely is a
>> difficult project requiring at least several weeks of dedicated work,
>> so it's not so surprising that it hasn't been accomplished yet. I know
>> that several people have expressed interest in working on it, but I
>> would like to see if anyone would be able to volunteer a commitment of
>> time and guess on a rough timeline when this work could be done. It
>> seems to me if this slips beyond 2020 it will significant diminish the
>> value being created by other parts of the project.
>>
>> Since I'm pretty familiar with all the Parquet code I'm one candidate
>> person to take on this project (and I can dedicate the time, but it
>> would come at the expense of other projects where I can also be
>> useful). But Micah and others expressed interest in working on it, so
>> I wanted to have a discussion about it to see what others think.
>>
>> Thanks
>> Wes
>>
>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to