> > Glad to hear about the progress. As I mentioned on #2, what do you > think about setting up a feature branch for you to merge PRs into? > Then the branch can be iterated on and we can merge it back when it's > feature complete and does not have perf regressions for the flat > read/write path. > > I'd like to avoid a separate branch if possible. I'm willing to close the open PR till I'm sure it is needed but I'm hoping keeping PRs as small focused as possible with performance testing a long the way will be a better reviewer and developer experience here.
The earliest I'd have time to work on this myself would likely be > sometime in March. Others are welcome to jump in as well (and it'd be > great to increase the overall level of knowledge of the Parquet > codebase) Hopefully, Igor can help out otherwise I'll take up the read path after I finish the write path. -Micah On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah > > On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Just to give an update. I've been a little bit delayed, but my progress > is > > as follows: > > 1. Had 1 PR merged that will exercise basic end-to-end tests. > > 2. Have another PR open that allows a configuration option in C++ to > > determine which algorithm version to use for reading/writing, the > existing > > version and the new version supported complex-nested arrays. I think a > > large amount of code will be reused/delegated to but I will err on the > side > > of not touching the existing code/algorithms so that any errors in the > > implementation or performance regressions can hopefully be mitigated at > > runtime. I expect in later releases (once the code has "baked") will > > become a no-op. > > Glad to hear about the progress. As I mentioned on #2, what do you > think about setting up a feature branch for you to merge PRs into? > Then the branch can be iterated on and we can merge it back when it's > feature complete and does not have perf regressions for the flat > read/write path. > > > 3. Started coding the write path. > > > > Which leaves: > > 1. Finishing the write path (I estimate 2-3 weeks) to be code complete > > 2. Implementing the read path. > > The earliest I'd have time to work on this myself would likely be > sometime in March. Others are welcome to jump in as well (and it'd be > great to increase the overall level of knowledge of the Parquet > codebase) > > > Again, I'm happy to collaborate if people have bandwidth and want to > > contribute. > > > > Thanks, > > Micah > > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > Hi Wes, > > > I'm still interested in doing the work. But don't to hold anybody up > if > > > they have bandwidth. > > > > > > In order to actually make progress on this, my plan will be to: > > > 1. Help with the current Java review backlog through early next week > or > > > so (this has been taking the majority of my time allocated for Arrow > > > contributions for the last 6 months or so). > > > 2. Shift all my attention to trying to get this done (this means no > > > reviews other then closing out existing ones that I've started until > it is > > > done). Hopefully, other Java committers can help shrink the backlog > > > further (Jacques thanks for you recent efforts here). > > > > > > Thanks, > > > Micah > > > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > >> hi folks, > > >> > > >> I think we have reached a point where the incomplete C++ Parquet > > >> nested data assembly/disassembly is harming the value of several > > >> others parts of the project, for example the Datasets API. As another > > >> example, it's possible to ingest nested data from JSON but not write > > >> it to Parquet in general. > > >> > > >> Implementing the nested data read and write path completely is a > > >> difficult project requiring at least several weeks of dedicated work, > > >> so it's not so surprising that it hasn't been accomplished yet. I know > > >> that several people have expressed interest in working on it, but I > > >> would like to see if anyone would be able to volunteer a commitment of > > >> time and guess on a rough timeline when this work could be done. It > > >> seems to me if this slips beyond 2020 it will significant diminish the > > >> value being created by other parts of the project. > > >> > > >> Since I'm pretty familiar with all the Parquet code I'm one candidate > > >> person to take on this project (and I can dedicate the time, but it > > >> would come at the expense of other projects where I can also be > > >> useful). But Micah and others expressed interest in working on it, so > > >> I wanted to have a discussion about it to see what others think. > > >> > > >> Thanks > > >> Wes > > >> > > > >