Another status update. I've integrated the level generation code with the parquet writing code [1].
After that PR is merged I'll add bindings in Python to control versions of the level generation algorithm and plan on moving on to the read side. Thanks, Micah [1] https://github.com/apache/arrow/pull/6586 On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Igor, > If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might > be a good task to pick up for this I think it should be a relatively small > amount of code, so it is probably a good contribution to the project. Once > that is wrapped up we can see were we both are. > > Cheers, > Micah > > On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <igor.calab...@gmail.com> > wrote: > >> Hi Micah, I actually got involved with another personal project and had >> to postpone my contribution to arrow a bit. The good news is that I'm >> almost done with it, so I could help you with the read side very soon. Any >> ideas how we could coordinate this? >> >> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <wesmck...@gmail.com> >> escreveu: >> >>> hi Micah -- great news on the level generation PR. I'll try to carve >>> out some time for reviewing over the coming week. >>> >>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> > >>> > Hi Igor, >>> > I was wondering if you have made any progress on this? >>> > >>> > I posted a new PR [1] which I believe handles the difficult algorithmic >>> > part of writing. There will be some follow-ups but I think this PR >>> might >>> > take a while to review, so I was thinking of starting to take a look >>> at the >>> > read side if you haven't started yet, and circle back to the final >>> > integration for the write side once the PR is checked in. >>> > >>> > Thanks, >>> > Micah >>> > >>> > [1] https://github.com/apache/arrow/pull/6490 >>> > >>> > On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <igor.calab...@gmail.com> >>> > wrote: >>> > >>> > > Hi, I would love to help with this issue. I'm aware that this is a >>> huge >>> > > task for a first contribution to arrow, but I feel that I could help >>> with >>> > > the read path. >>> > > Reading parquet seems like a extremely complex task since both >>> hive[0] and >>> > > spark[1] tried to implement a "vectorized" version and they all >>> stopped >>> > > short of supporting complex types. >>> > > I wanted to at least give it a try and find out where the challenge >>> lies. >>> > > >>> > > Since you guys are much more familiar with the current code base, I >>> could >>> > > use some starting tips so I don't fall in common pitfalls and >>> whatnot. >>> > > >>> > > [0] https://issues.apache.org/jira/browse/HIVE-18576 >>> > > [1] >>> > > >>> > > >>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45 >>> > > >>> > > On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> wrote: >>> > > > Just to give an update. I've been a little bit delayed, but my >>> progress >>> > > is> >>> > > > as follows:> >>> > > > 1. Had 1 PR merged that will exercise basic end-to-end tests.> >>> > > > 2. Have another PR open that allows a configuration option in C++ >>> to> >>> > > > determine which algorithm version to use for reading/writing, the >>> > > existing> >>> > > > version and the new version supported complex-nested arrays. I >>> think a> >>> > > > large amount of code will be reused/delegated to but I will err on >>> the >>> > > side> >>> > > > of not touching the existing code/algorithms so that any errors in >>> the> >>> > > > implementation or performance regressions can hopefully be >>> mitigated at> >>> > > > runtime. I expect in later releases (once the code has "baked") >>> will> >>> > > > become a no-op.> >>> > > > 3. Started coding the write path.> >>> > > > >>> > > > Which leaves:> >>> > > > 1. Finishing the write path (I estimate 2-3 weeks) to be code >>> complete> >>> > > > 2. Implementing the read path.> >>> > > > >>> > > > Again, I'm happy to collaborate if people have bandwidth and want >>> to> >>> > > > contribute.> >>> > > > >>> > > > Thanks,> >>> > > > Micah> >>> > > > >>> > > > On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <em...@gmail.com>> >>> > > > wrote:> >>> > > > >>> > > > > Hi Wes,> >>> > > > > I'm still interested in doing the work. But don't to hold >>> anybody up >>> > > if> >>> > > > > they have bandwidth.> >>> > > > >> >>> > > > > In order to actually make progress on this, my plan will be to:> >>> > > > > 1. Help with the current Java review backlog through early next >>> week >>> > > or> >>> > > > > so (this has been taking the majority of my time allocated for >>> Arrow> >>> > > > > contributions for the last 6 months or so).> >>> > > > > 2. Shift all my attention to trying to get this done (this >>> means no> >>> > > > > reviews other then closing out existing ones that I've started >>> until it >>> > > is> >>> > > > > done). Hopefully, other Java committers can help shrink the >>> backlog> >>> > > > > further (Jacques thanks for you recent efforts here).> >>> > > > >> >>> > > > > Thanks,> >>> > > > > Micah> >>> > > > >> >>> > > > > On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <we...@gmail.com> >>> wrote:> >>> > > > >> >>> > > > >> hi folks,> >>> > > > >>> >>> > > > >> I think we have reached a point where the incomplete C++ >>> Parquet> >>> > > > >> nested data assembly/disassembly is harming the value of >>> several> >>> > > > >> others parts of the project, for example the Datasets API. As >>> another> >>> > > > >> example, it's possible to ingest nested data from JSON but not >>> write> >>> > > > >> it to Parquet in general.> >>> > > > >>> >>> > > > >> Implementing the nested data read and write path completely is >>> a> >>> > > > >> difficult project requiring at least several weeks of dedicated >>> work,> >>> > > > >> so it's not so surprising that it hasn't been accomplished yet. >>> I >>> > > know> >>> > > > >> that several people have expressed interest in working on it, >>> but I> >>> > > > >> would like to see if anyone would be able to volunteer a >>> commitment >>> > > of> >>> > > > >> time and guess on a rough timeline when this work could be >>> done. It> >>> > > > >> seems to me if this slips beyond 2020 it will significant >>> diminish >>> > > the> >>> > > > >> value being created by other parts of the project.> >>> > > > >>> >>> > > > >> Since I'm pretty familiar with all the Parquet code I'm one >>> candidate> >>> > > > >> person to take on this project (and I can dedicate the time, >>> but it> >>> > > > >> would come at the expense of other projects where I can also be> >>> > > > >> useful). But Micah and others expressed interest in working on >>> it, so> >>> > > > >> I wanted to have a discussion about it to see what others >>> think.> >>> > > > >>> >>> > > > >> Thanks> >>> > > > >> Wes> >>> > > > >>> >>> > > > >> >>> > > > >>> > > >>> >>