Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Micah Kornfield Wed, 11 Mar 2020 20:54:25 -0700

Another status update.  I've integrated the level generation code with the
parquet writing code [1].


After that PR is merged I'll add bindings in Python to control versions of
the level generation algorithm and plan on moving on to the read side.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/6586

On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <[email protected]>
wrote:

> Hi Igor,
> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might
> be a good task to pick up for this I think it should be a relatively small
> amount of code, so it is probably a good contribution to the project.  Once
> that is wrapped up we can see were we both are.
>
> Cheers,
> Micah
>
> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <[email protected]>
> wrote:
>
>> Hi Micah, I actually got involved with another personal project and had
>> to postpone my contribution to arrow a bit. The good news is that I'm
>> almost done with it, so I could help you with the read side very soon. Any
>> ideas how we could coordinate this?
>>
>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <[email protected]>
>> escreveu:
>>
>>> hi Micah -- great news on the level generation PR. I'll try to carve
>>> out some time for reviewing over the coming week.
>>>
>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <[email protected]>
>>> wrote:
>>> >
>>> > Hi Igor,
>>> > I was wondering if you have made any progress on this?
>>> >
>>> > I posted a new PR [1] which I believe handles the difficult algorithmic
>>> > part of writing.  There will be some follow-ups but I think this PR
>>> might
>>> > take a while to review, so I was thinking of starting to take a look
>>> at the
>>> > read side if you haven't started yet, and circle back to the final
>>> > integration for the write side once the PR is checked in.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1] https://github.com/apache/arrow/pull/6490
>>> >
>>> > On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <[email protected]>
>>> > wrote:
>>> >
>>> > > Hi, I would love to help with this issue. I'm aware that this is a
>>> huge
>>> > > task for a first contribution to arrow, but I feel that I could help
>>> with
>>> > > the read path.
>>> > > Reading parquet seems like a extremely complex task since both
>>> hive[0] and
>>> > > spark[1] tried to implement a "vectorized" version and they all
>>> stopped
>>> > > short of supporting complex types.
>>> > > I wanted to at least give it a try and find out where the challenge
>>> lies.
>>> > >
>>> > > Since you guys are much more familiar with the current code base, I
>>> could
>>> > > use some starting tips so I don't fall in common pitfalls and
>>> whatnot.
>>> > >
>>> > > [0] https://issues.apache.org/jira/browse/HIVE-18576
>>> > > [1]
>>> > >
>>> > >
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45
>>> > >
>>> > > On 2020/02/03 06:01:25, Micah Kornfield <[email protected]> wrote:
>>> > > > Just to give an update.  I've been a little bit delayed, but my
>>> progress
>>> > > is>
>>> > > > as follows:>
>>> > > > 1.  Had 1 PR merged that will exercise basic end-to-end tests.>
>>> > > > 2.  Have another PR open that allows a configuration option in C++
>>> to>
>>> > > > determine which algorithm version to use for reading/writing, the
>>> > > existing>
>>> > > > version and the new version supported complex-nested arrays.  I
>>> think a>
>>> > > > large amount of code will be reused/delegated to but I will err on
>>> the
>>> > > side>
>>> > > > of not touching the existing code/algorithms so that any errors in
>>> the>
>>> > > > implementation  or performance regressions can hopefully be
>>> mitigated at>
>>> > > > runtime.  I expect in later releases (once the code has "baked")
>>> will>
>>> > > > become a no-op.>
>>> > > > 3.  Started coding the write path.>
>>> > > >
>>> > > > Which leaves:>
>>> > > > 1.  Finishing the write path (I estimate 2-3 weeks) to be code
>>> complete>
>>> > > > 2.  Implementing the read path.>
>>> > > >
>>> > > > Again, I'm happy to collaborate if people have bandwidth and want
>>> to>
>>> > > > contribute.>
>>> > > >
>>> > > > Thanks,>
>>> > > > Micah>
>>> > > >
>>> > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <[email protected]>>
>>> > > > wrote:>
>>> > > >
>>> > > > > Hi Wes,>
>>> > > > > I'm still interested in doing the work.  But don't to hold
>>> anybody up
>>> > > if>
>>> > > > > they have bandwidth.>
>>> > > > >>
>>> > > > > In order to actually make progress on this, my plan will be to:>
>>> > > > > 1.  Help with the current Java review backlog through early next
>>> week
>>> > > or>
>>> > > > > so (this has been taking the majority of my time allocated for
>>> Arrow>
>>> > > > > contributions for the last 6 months or so).>
>>> > > > > 2.  Shift all my attention to trying to get this done (this
>>> means no>
>>> > > > > reviews other then closing out existing ones that I've started
>>> until it
>>> > > is>
>>> > > > > done).  Hopefully, other Java committers can help shrink the
>>> backlog>
>>> > > > > further (Jacques thanks for you recent efforts here).>
>>> > > > >>
>>> > > > > Thanks,>
>>> > > > > Micah>
>>> > > > >>
>>> > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <[email protected]>
>>> wrote:>
>>> > > > >>
>>> > > > >> hi folks,>
>>> > > > >>>
>>> > > > >> I think we have reached a point where the incomplete C++
>>> Parquet>
>>> > > > >> nested data assembly/disassembly is harming the value of
>>> several>
>>> > > > >> others parts of the project, for example the Datasets API. As
>>> another>
>>> > > > >> example, it's possible to ingest nested data from JSON but not
>>> write>
>>> > > > >> it to Parquet in general.>
>>> > > > >>>
>>> > > > >> Implementing the nested data read and write path completely is
>>> a>
>>> > > > >> difficult project requiring at least several weeks of dedicated
>>> work,>
>>> > > > >> so it's not so surprising that it hasn't been accomplished yet.
>>> I
>>> > > know>
>>> > > > >> that several people have expressed interest in working on it,
>>> but I>
>>> > > > >> would like to see if anyone would be able to volunteer a
>>> commitment
>>> > > of>
>>> > > > >> time and guess on a rough timeline when this work could be
>>> done. It>
>>> > > > >> seems to me if this slips beyond 2020 it will significant
>>> diminish
>>> > > the>
>>> > > > >> value being created by other parts of the project.>
>>> > > > >>>
>>> > > > >> Since I'm pretty familiar with all the Parquet code I'm one
>>> candidate>
>>> > > > >> person to take on this project (and I can dedicate the time,
>>> but it>
>>> > > > >> would come at the expense of other projects where I can also be>
>>> > > > >> useful). But Micah and others expressed interest in working on
>>> it, so>
>>> > > > >> I wanted to have a discussion about it to see what others
>>> think.>
>>> > > > >>>
>>> > > > >> Thanks>
>>> > > > >> Wes>
>>> > > > >>>
>>> > > > >>
>>> > > >
>>> > >
>>>
>>

Re: Coordinating / scheduling C++ Parquet-Arrow nested data work (ARROW-1644 and others)

Reply via email to