Hi Wes, Thanks that seems like a good characterization. I opened up some JIRA subtasks on ARROW-1644 which go into a little more detail on tasks that can probably be worked on in parallel (I've only assigned ones to myself that I'm actively working on, happy to add discuss/collaborate on the finer points on the JIRAs). There will probably be a few more JIRAs to open to do final integration work (e.g. a flag to switch between old and new engines).
For unit tests (Item B). as noted earlier in the thread there is already a disabled unit test trying to verify basic ability to round-trip but that probably isn't sufficient. Thanks, Micah On Wed, Apr 15, 2020 at 9:32 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah, > > Sounds good. It seems like there are a few projects where people might > be able to work without stepping on each other's toes > > A. Array reassembly from raw repetition/definition levels (I would > guess this would be your focus) > B. Schema and data generation for round-trip correctness and > performance testing (I reckon that the unit tests for A will largely > be hand-written examples like you did for the write path) > C. Benchmarks, particularly to be able to assess performance changes > going from the old incomplete implementations to the new ones > > Some of us should be able to pitch in to help with this. Might also be > a good opportunity to do some cleanup of the test code in > cpp/src/parquet/arrow > > - Wes > > On Tue, Apr 14, 2020 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Hi Wes, > > Yes, I'm making progress and at this point I anticipate being able to > finish it off by next release, possibly without support for round tripping > fixed size lists. I've been spending some time thinking about different > approaches and have started coding some of the building blocks, which I > think in the common case (relatively low nesting levels) should be fairly > performant (I'm also going to write some benchmarks to sanity check this). > One caveat to this is my schedule is going to change slightly next week and > its possible my bandwidth might be more limited, I'll update the list if > this happens. > > > > I think there are at least two areas that I'm not working on that could > be parallelized if you or your team has bandwidth. > > > > 1. It would be good to have some parquet files representing real world > datasets available to benchmark against. > > 2. The higher level book keeping of tracking which def-levels/rep-levels > are needed to compare against for any particular column (i.e. preceding > repeated parent). I'm currently working on the code that takes these and > converts them to offsets/null fields. > > > > I can go into more details if you or your team would like to collaborate. > > > > Thanks, > > Micah > > > > On Tue, Apr 14, 2020 at 7:48 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> hi Micah, > >> > >> I'm glad that we have the write side of nested completed for 0.17.0. > >> > >> As far as completing the read side and then implementing sufficient > >> testing to exercise corner cases in end-to-end reads/writes, do you > >> anticipate being able to work on this in the next 4-6 weeks (obviously > >> the state of the world has affected everyone's availability / > >> bandwidth)? I ask because someone from my team (or me also) may be > >> able to get involved and help this move along. It'd be great to have > >> this 100% completed and checked off our list for the next release > >> (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration > >> tests get completed also) > >> > >> thanks > >> Wes > >> > >> On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> >> > >> >> Glad to hear about the progress. As I mentioned on #2, what do you > >> >> think about setting up a feature branch for you to merge PRs into? > >> >> Then the branch can be iterated on and we can merge it back when it's > >> >> feature complete and does not have perf regressions for the flat > >> >> read/write path. > >> >> > >> > I'd like to avoid a separate branch if possible. I'm willing to > close the open PR till I'm sure it is needed but I'm hoping keeping PRs as > small focused as possible with performance testing a long the way will be a > better reviewer and developer experience here. > >> > > >> >> The earliest I'd have time to work on this myself would likely be > >> >> sometime in March. Others are welcome to jump in as well (and it'd be > >> >> great to increase the overall level of knowledge of the Parquet > >> >> codebase) > >> > > >> > Hopefully, Igor can help out otherwise I'll take up the read path > after I finish the write path. > >> > > >> > -Micah > >> > > >> > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> >> > >> >> hi Micah > >> >> > >> >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield < > emkornfi...@gmail.com> wrote: > >> >> > > >> >> > Just to give an update. I've been a little bit delayed, but my > progress is > >> >> > as follows: > >> >> > 1. Had 1 PR merged that will exercise basic end-to-end tests. > >> >> > 2. Have another PR open that allows a configuration option in C++ > to > >> >> > determine which algorithm version to use for reading/writing, the > existing > >> >> > version and the new version supported complex-nested arrays. I > think a > >> >> > large amount of code will be reused/delegated to but I will err on > the side > >> >> > of not touching the existing code/algorithms so that any errors in > the > >> >> > implementation or performance regressions can hopefully be > mitigated at > >> >> > runtime. I expect in later releases (once the code has "baked") > will > >> >> > become a no-op. > >> >> > >> >> Glad to hear about the progress. As I mentioned on #2, what do you > >> >> think about setting up a feature branch for you to merge PRs into? > >> >> Then the branch can be iterated on and we can merge it back when it's > >> >> feature complete and does not have perf regressions for the flat > >> >> read/write path. > >> >> > >> >> > 3. Started coding the write path. > >> >> > > >> >> > Which leaves: > >> >> > 1. Finishing the write path (I estimate 2-3 weeks) to be code > complete > >> >> > 2. Implementing the read path. > >> >> > >> >> The earliest I'd have time to work on this myself would likely be > >> >> sometime in March. Others are welcome to jump in as well (and it'd be > >> >> great to increase the overall level of knowledge of the Parquet > >> >> codebase) > >> >> > >> >> > Again, I'm happy to collaborate if people have bandwidth and want > to > >> >> > contribute. > >> >> > > >> >> > Thanks, > >> >> > Micah > >> >> > > >> >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield < > emkornfi...@gmail.com> > >> >> > wrote: > >> >> > > >> >> > > Hi Wes, > >> >> > > I'm still interested in doing the work. But don't to hold > anybody up if > >> >> > > they have bandwidth. > >> >> > > > >> >> > > In order to actually make progress on this, my plan will be to: > >> >> > > 1. Help with the current Java review backlog through early next > week or > >> >> > > so (this has been taking the majority of my time allocated for > Arrow > >> >> > > contributions for the last 6 months or so). > >> >> > > 2. Shift all my attention to trying to get this done (this > means no > >> >> > > reviews other then closing out existing ones that I've started > until it is > >> >> > > done). Hopefully, other Java committers can help shrink the > backlog > >> >> > > further (Jacques thanks for you recent efforts here). > >> >> > > > >> >> > > Thanks, > >> >> > > Micah > >> >> > > > >> >> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >> >> > > > >> >> > >> hi folks, > >> >> > >> > >> >> > >> I think we have reached a point where the incomplete C++ Parquet > >> >> > >> nested data assembly/disassembly is harming the value of several > >> >> > >> others parts of the project, for example the Datasets API. As > another > >> >> > >> example, it's possible to ingest nested data from JSON but not > write > >> >> > >> it to Parquet in general. > >> >> > >> > >> >> > >> Implementing the nested data read and write path completely is a > >> >> > >> difficult project requiring at least several weeks of dedicated > work, > >> >> > >> so it's not so surprising that it hasn't been accomplished yet. > I know > >> >> > >> that several people have expressed interest in working on it, > but I > >> >> > >> would like to see if anyone would be able to volunteer a > commitment of > >> >> > >> time and guess on a rough timeline when this work could be > done. It > >> >> > >> seems to me if this slips beyond 2020 it will significant > diminish the > >> >> > >> value being created by other parts of the project. > >> >> > >> > >> >> > >> Since I'm pretty familiar with all the Parquet code I'm one > candidate > >> >> > >> person to take on this project (and I can dedicate the time, > but it > >> >> > >> would come at the expense of other projects where I can also be > >> >> > >> useful). But Micah and others expressed interest in working on > it, so > >> >> > >> I wanted to have a discussion about it to see what others think. > >> >> > >> > >> >> > >> Thanks > >> >> > >> Wes > >> >> > >> > >> >> > > >