Sounds good. In general I would say that this is a good opportunity to make improvements around random data generation. For example, I don't think we have an API for generating a RecordBatch given a schema and some options (e.g. probability of nulls, distribution of list sizes), for example, but that would be a good thing to have to assist both with perf and correctness testing.
On Thu, Apr 16, 2020 at 11:28 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Hi Wes, > Thanks that seems like a good characterization. I opened up some JIRA > subtasks on ARROW-1644 which go into a little more detail on tasks that can > probably be worked on in parallel (I've only assigned ones to myself that I'm > actively working on, happy to add discuss/collaborate on the finer points on > the JIRAs). There will probably be a few more JIRAs to open to do final > integration work (e.g. a flag to switch between old and new engines). > > For unit tests (Item B). as noted earlier in the thread there is already a > disabled unit test trying to verify basic ability to round-trip but that > probably isn't sufficient. > > Thanks, > Micah > > On Wed, Apr 15, 2020 at 9:32 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >> hi Micah, >> >> Sounds good. It seems like there are a few projects where people might >> be able to work without stepping on each other's toes >> >> A. Array reassembly from raw repetition/definition levels (I would >> guess this would be your focus) >> B. Schema and data generation for round-trip correctness and >> performance testing (I reckon that the unit tests for A will largely >> be hand-written examples like you did for the write path) >> C. Benchmarks, particularly to be able to assess performance changes >> going from the old incomplete implementations to the new ones >> >> Some of us should be able to pitch in to help with this. Might also be >> a good opportunity to do some cleanup of the test code in >> cpp/src/parquet/arrow >> >> - Wes >> >> On Tue, Apr 14, 2020 at 11:19 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > >> > Hi Wes, >> > Yes, I'm making progress and at this point I anticipate being able to >> > finish it off by next release, possibly without support for round tripping >> > fixed size lists. I've been spending some time thinking about different >> > approaches and have started coding some of the building blocks, which I >> > think in the common case (relatively low nesting levels) should be fairly >> > performant (I'm also going to write some benchmarks to sanity check this). >> > One caveat to this is my schedule is going to change slightly next week >> > and its possible my bandwidth might be more limited, I'll update the list >> > if this happens. >> > >> > I think there are at least two areas that I'm not working on that could be >> > parallelized if you or your team has bandwidth. >> > >> > 1. It would be good to have some parquet files representing real world >> > datasets available to benchmark against. >> > 2. The higher level book keeping of tracking which def-levels/rep-levels >> > are needed to compare against for any particular column (i.e. preceding >> > repeated parent). I'm currently working on the code that takes these and >> > converts them to offsets/null fields. >> > >> > I can go into more details if you or your team would like to collaborate. >> > >> > Thanks, >> > Micah >> > >> > On Tue, Apr 14, 2020 at 7:48 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> hi Micah, >> >> >> >> I'm glad that we have the write side of nested completed for 0.17.0. >> >> >> >> As far as completing the read side and then implementing sufficient >> >> testing to exercise corner cases in end-to-end reads/writes, do you >> >> anticipate being able to work on this in the next 4-6 weeks (obviously >> >> the state of the world has affected everyone's availability / >> >> bandwidth)? I ask because someone from my team (or me also) may be >> >> able to get involved and help this move along. It'd be great to have >> >> this 100% completed and checked off our list for the next release >> >> (i.e. 0.18.0 or 1.0.0 depending on whether the Java/C++ integration >> >> tests get completed also) >> >> >> >> thanks >> >> Wes >> >> >> >> On Wed, Feb 5, 2020 at 12:12 AM Micah Kornfield <emkornfi...@gmail.com> >> >> wrote: >> >> >> >> >> >> Glad to hear about the progress. As I mentioned on #2, what do you >> >> >> think about setting up a feature branch for you to merge PRs into? >> >> >> Then the branch can be iterated on and we can merge it back when it's >> >> >> feature complete and does not have perf regressions for the flat >> >> >> read/write path. >> >> >> >> >> > I'd like to avoid a separate branch if possible. I'm willing to close >> >> > the open PR till I'm sure it is needed but I'm hoping keeping PRs as >> >> > small focused as possible with performance testing a long the way will >> >> > be a better reviewer and developer experience here. >> >> > >> >> >> The earliest I'd have time to work on this myself would likely be >> >> >> sometime in March. Others are welcome to jump in as well (and it'd be >> >> >> great to increase the overall level of knowledge of the Parquet >> >> >> codebase) >> >> > >> >> > Hopefully, Igor can help out otherwise I'll take up the read path after >> >> > I finish the write path. >> >> > >> >> > -Micah >> >> > >> >> > On Tue, Feb 4, 2020 at 3:31 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> >> >> >> >> hi Micah >> >> >> >> >> >> On Mon, Feb 3, 2020 at 12:01 AM Micah Kornfield >> >> >> <emkornfi...@gmail.com> wrote: >> >> >> > >> >> >> > Just to give an update. I've been a little bit delayed, but my >> >> >> > progress is >> >> >> > as follows: >> >> >> > 1. Had 1 PR merged that will exercise basic end-to-end tests. >> >> >> > 2. Have another PR open that allows a configuration option in C++ to >> >> >> > determine which algorithm version to use for reading/writing, the >> >> >> > existing >> >> >> > version and the new version supported complex-nested arrays. I >> >> >> > think a >> >> >> > large amount of code will be reused/delegated to but I will err on >> >> >> > the side >> >> >> > of not touching the existing code/algorithms so that any errors in >> >> >> > the >> >> >> > implementation or performance regressions can hopefully be >> >> >> > mitigated at >> >> >> > runtime. I expect in later releases (once the code has "baked") will >> >> >> > become a no-op. >> >> >> >> >> >> Glad to hear about the progress. As I mentioned on #2, what do you >> >> >> think about setting up a feature branch for you to merge PRs into? >> >> >> Then the branch can be iterated on and we can merge it back when it's >> >> >> feature complete and does not have perf regressions for the flat >> >> >> read/write path. >> >> >> >> >> >> > 3. Started coding the write path. >> >> >> > >> >> >> > Which leaves: >> >> >> > 1. Finishing the write path (I estimate 2-3 weeks) to be code >> >> >> > complete >> >> >> > 2. Implementing the read path. >> >> >> >> >> >> The earliest I'd have time to work on this myself would likely be >> >> >> sometime in March. Others are welcome to jump in as well (and it'd be >> >> >> great to increase the overall level of knowledge of the Parquet >> >> >> codebase) >> >> >> >> >> >> > Again, I'm happy to collaborate if people have bandwidth and want to >> >> >> > contribute. >> >> >> > >> >> >> > Thanks, >> >> >> > Micah >> >> >> > >> >> >> > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield >> >> >> > <emkornfi...@gmail.com> >> >> >> > wrote: >> >> >> > >> >> >> > > Hi Wes, >> >> >> > > I'm still interested in doing the work. But don't to hold anybody >> >> >> > > up if >> >> >> > > they have bandwidth. >> >> >> > > >> >> >> > > In order to actually make progress on this, my plan will be to: >> >> >> > > 1. Help with the current Java review backlog through early next >> >> >> > > week or >> >> >> > > so (this has been taking the majority of my time allocated for >> >> >> > > Arrow >> >> >> > > contributions for the last 6 months or so). >> >> >> > > 2. Shift all my attention to trying to get this done (this means >> >> >> > > no >> >> >> > > reviews other then closing out existing ones that I've started >> >> >> > > until it is >> >> >> > > done). Hopefully, other Java committers can help shrink the >> >> >> > > backlog >> >> >> > > further (Jacques thanks for you recent efforts here). >> >> >> > > >> >> >> > > Thanks, >> >> >> > > Micah >> >> >> > > >> >> >> > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <wesmck...@gmail.com> >> >> >> > > wrote: >> >> >> > > >> >> >> > >> hi folks, >> >> >> > >> >> >> >> > >> I think we have reached a point where the incomplete C++ Parquet >> >> >> > >> nested data assembly/disassembly is harming the value of several >> >> >> > >> others parts of the project, for example the Datasets API. As >> >> >> > >> another >> >> >> > >> example, it's possible to ingest nested data from JSON but not >> >> >> > >> write >> >> >> > >> it to Parquet in general. >> >> >> > >> >> >> >> > >> Implementing the nested data read and write path completely is a >> >> >> > >> difficult project requiring at least several weeks of dedicated >> >> >> > >> work, >> >> >> > >> so it's not so surprising that it hasn't been accomplished yet. I >> >> >> > >> know >> >> >> > >> that several people have expressed interest in working on it, but >> >> >> > >> I >> >> >> > >> would like to see if anyone would be able to volunteer a >> >> >> > >> commitment of >> >> >> > >> time and guess on a rough timeline when this work could be done. >> >> >> > >> It >> >> >> > >> seems to me if this slips beyond 2020 it will significant >> >> >> > >> diminish the >> >> >> > >> value being created by other parts of the project. >> >> >> > >> >> >> >> > >> Since I'm pretty familiar with all the Parquet code I'm one >> >> >> > >> candidate >> >> >> > >> person to take on this project (and I can dedicate the time, but >> >> >> > >> it >> >> >> > >> would come at the expense of other projects where I can also be >> >> >> > >> useful). But Micah and others expressed interest in working on >> >> >> > >> it, so >> >> >> > >> I wanted to have a discussion about it to see what others think. >> >> >> > >> >> >> >> > >> Thanks >> >> >> > >> Wes >> >> >> > >> >> >> >> > >