>
> Glad to hear about the progress. As I mentioned on #2, what do you
> think about setting up a feature branch for you to merge PRs into?
> Then the branch can be iterated on and we can merge it back when it's
> feature complete and does not have perf regressions for the flat
> read/write path.
>
> I'd like to avoid a separate branch if possible.  I'm willing to close the
open PR till I'm sure it is needed but I'm hoping keeping PRs as small
focused as possible with performance testing a long the way will be a
better reviewer and developer experience here.

The earliest I'd have time to work on this myself would likely be
> sometime in March. Others are welcome to jump in as well (and it'd be
> great to increase the overall level of knowledge of the Parquet
> codebase)

Hopefully, Igor can help out otherwise I'll take up the read path after I
finish the write path.

-Micah

On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Micah
>
> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Just to give an update.  I've been a little bit delayed, but my progress
> is
> > as follows:
> > 1.  Had 1 PR merged that will exercise basic end-to-end tests.
> > 2.  Have another PR open that allows a configuration option in C++ to
> > determine which algorithm version to use for reading/writing, the
> existing
> > version and the new version supported complex-nested arrays.  I think a
> > large amount of code will be reused/delegated to but I will err on the
> side
> > of not touching the existing code/algorithms so that any errors in the
> > implementation  or performance regressions can hopefully be mitigated at
> > runtime.  I expect in later releases (once the code has "baked") will
> > become a no-op.
>
> Glad to hear about the progress. As I mentioned on #2, what do you
> think about setting up a feature branch for you to merge PRs into?
> Then the branch can be iterated on and we can merge it back when it's
> feature complete and does not have perf regressions for the flat
> read/write path.
>
> > 3.  Started coding the write path.
> >
> > Which leaves:
> > 1.  Finishing the write path (I estimate 2-3 weeks) to be code complete
> > 2.  Implementing the read path.
>
> The earliest I'd have time to work on this myself would likely be
> sometime in March. Others are welcome to jump in as well (and it'd be
> great to increase the overall level of knowledge of the Parquet
> codebase)
>
> > Again, I'm happy to collaborate if people have bandwidth and want to
> > contribute.
> >
> > Thanks,
> > Micah
> >
> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hi Wes,
> > > I'm still interested in doing the work.  But don't to hold anybody up
> if
> > > they have bandwidth.
> > >
> > > In order to actually make progress on this, my plan will be to:
> > > 1.  Help with the current Java review backlog through early next week
> or
> > > so (this has been taking the majority of my time allocated for Arrow
> > > contributions for the last 6 months or so).
> > > 2.  Shift all my attention to trying to get this done (this means no
> > > reviews other then closing out existing ones that I've started until
> it is
> > > done).  Hopefully, other Java committers can help shrink the backlog
> > > further (Jacques thanks for you recent efforts here).
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > >> hi folks,
> > >>
> > >> I think we have reached a point where the incomplete C++ Parquet
> > >> nested data assembly/disassembly is harming the value of several
> > >> others parts of the project, for example the Datasets API. As another
> > >> example, it's possible to ingest nested data from JSON but not write
> > >> it to Parquet in general.
> > >>
> > >> Implementing the nested data read and write path completely is a
> > >> difficult project requiring at least several weeks of dedicated work,
> > >> so it's not so surprising that it hasn't been accomplished yet. I know
> > >> that several people have expressed interest in working on it, but I
> > >> would like to see if anyone would be able to volunteer a commitment of
> > >> time and guess on a rough timeline when this work could be done. It
> > >> seems to me if this slips beyond 2020 it will significant diminish the
> > >> value being created by other parts of the project.
> > >>
> > >> Since I'm pretty familiar with all the Parquet code I'm one candidate
> > >> person to take on this project (and I can dedicate the time, but it
> > >> would come at the expense of other projects where I can also be
> > >> useful). But Micah and others expressed interest in working on it, so
> > >> I wanted to have a discussion about it to see what others think.
> > >>
> > >> Thanks
> > >> Wes
> > >>
> > >
>

Reply via email to