> > I'd be OK with having the flag so long as the new code is the default. > Otherwise we'll never find out about the corner cases.
Completely agree, my PR makes the new code the default. On Fri, Mar 13, 2020 at 6:44 AM Wes McKinney <wesmck...@gmail.com> wrote: > On Thu, Mar 12, 2020 at 10:39 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Maarten, I don't expect regressions for flat cases (I'm going to try to > run > > benchmarks comparison tonight). > > > > In terms of the flag, I'm more concerned about some corner case I didn't > > think of in testing or a workload that for some reason is better with the > > prior code. If either of these arise I would like to give users a way to > > mitigate issues so a patch release isn't an imperative. > > I'd be OK with having the flag so long as the new code is the default. > Otherwise we'll never find out about the corner cases. > > > I'm happy to cleanup the existing code as soon as the next release > happens > > though. > > > > Thoughts? > > > > -Micah > > > > On Thu, Mar 12, 2020 at 9:11 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > Maarten -- AFAIK Micah's work only affects nested / non-flat column > > > paths, so flat data should not be impacted. Since we have a partial > > > implementation of writes for nested data (lists-of-lists and > > > structs-of-structs, but no mix of the two) that was the performance > > > difference I was referencing. > > > > > > On Thu, Mar 12, 2020 at 10:43 AM Maarten Ballintijn < > maart...@xs4all.nl> > > > wrote: > > > > > > > > Hi Micah, > > > > > > > > How does the performance change for “flat” schemas? > > > > (particularly in the case of a large number of columns) > > > > > > > > Thanks, > > > > Maarten > > > > > > > > > > > > > > > > > On Mar 11, 2020, at 11:53 PM, Micah Kornfield < > emkornfi...@gmail.com> > > > wrote: > > > > > > > > > > Another status update. I've integrated the level generation code > with > > > the > > > > > parquet writing code [1]. > > > > > > > > > > After that PR is merged I'll add bindings in Python to control > > > versions of > > > > > the level generation algorithm and plan on moving on to the read > side. > > > > > > > > > > Thanks, > > > > > Micah > > > > > > > > > > [1] https://github.com/apache/arrow/pull/6586 > > > > > > > > > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield < > emkornfi...@gmail.com> > > > > > wrote: > > > > > > > > > >> Hi Igor, > > > > >> If you have the time > https://issues.apache.org/jira/browse/ARROW-7960 > > > might > > > > >> be a good task to pick up for this I think it should be a > relatively > > > small > > > > >> amount of code, so it is probably a good contribution to the > > > project. Once > > > > >> that is wrapped up we can see were we both are. > > > > >> > > > > >> Cheers, > > > > >> Micah > > > > >> > > > > >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria < > igor.calab...@gmail.com > > > > > > > > >> wrote: > > > > >> > > > > >>> Hi Micah, I actually got involved with another personal project > and > > > had > > > > >>> to postpone my contribution to arrow a bit. The good news is > that I'm > > > > >>> almost done with it, so I could help you with the read side very > > > soon. Any > > > > >>> ideas how we could coordinate this? > > > > >>> > > > > >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney < > > > wesmck...@gmail.com> > > > > >>> escreveu: > > > > >>> > > > > >>>> hi Micah -- great news on the level generation PR. I'll try to > carve > > > > >>>> out some time for reviewing over the coming week. > > > > >>>> > > > > >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield < > > > emkornfi...@gmail.com> > > > > >>>> wrote: > > > > >>>>> > > > > >>>>> Hi Igor, > > > > >>>>> I was wondering if you have made any progress on this? > > > > >>>>> > > > > >>>>> I posted a new PR [1] which I believe handles the difficult > > > algorithmic > > > > >>>>> part of writing. There will be some follow-ups but I think > this PR > > > > >>>> might > > > > >>>>> take a while to review, so I was thinking of starting to take a > > > look > > > > >>>> at the > > > > >>>>> read side if you haven't started yet, and circle back to the > final > > > > >>>>> integration for the write side once the PR is checked in. > > > > >>>>> > > > > >>>>> Thanks, > > > > >>>>> Micah > > > > >>>>> > > > > >>>>> [1] https://github.com/apache/arrow/pull/6490 > > > > >>>>> > > > > >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria < > > > igor.calab...@gmail.com> > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>>> Hi, I would love to help with this issue. I'm aware that this > is a > > > > >>>> huge > > > > >>>>>> task for a first contribution to arrow, but I feel that I > could > > > help > > > > >>>> with > > > > >>>>>> the read path. > > > > >>>>>> Reading parquet seems like a extremely complex task since both > > > > >>>> hive[0] and > > > > >>>>>> spark[1] tried to implement a "vectorized" version and they > all > > > > >>>> stopped > > > > >>>>>> short of supporting complex types. > > > > >>>>>> I wanted to at least give it a try and find out where the > > > challenge > > > > >>>> lies. > > > > >>>>>> > > > > >>>>>> Since you guys are much more familiar with the current code > base, > > > I > > > > >>>> could > > > > >>>>>> use some starting tips so I don't fall in common pitfalls and > > > > >>>> whatnot. > > > > >>>>>> > > > > >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576 > > > > >>>>>> [1] > > > > >>>>>> > > > > >>>>>> > > > > >>>> > > > > https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45 > > > > >>>>>> > > > > >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <e...@gmail.com> > wrote: > > > > >>>>>>> Just to give an update. I've been a little bit delayed, but > my > > > > >>>> progress > > > > >>>>>> is> > > > > >>>>>>> as follows:> > > > > >>>>>>> 1. Had 1 PR merged that will exercise basic end-to-end > tests.> > > > > >>>>>>> 2. Have another PR open that allows a configuration option > in > > > C++ > > > > >>>> to> > > > > >>>>>>> determine which algorithm version to use for > reading/writing, the > > > > >>>>>> existing> > > > > >>>>>>> version and the new version supported complex-nested > arrays. I > > > > >>>> think a> > > > > >>>>>>> large amount of code will be reused/delegated to but I will > err > > > on > > > > >>>> the > > > > >>>>>> side> > > > > >>>>>>> of not touching the existing code/algorithms so that any > errors > > > in > > > > >>>> the> > > > > >>>>>>> implementation or performance regressions can hopefully be > > > > >>>> mitigated at> > > > > >>>>>>> runtime. I expect in later releases (once the code has > "baked") > > > > >>>> will> > > > > >>>>>>> become a no-op.> > > > > >>>>>>> 3. Started coding the write path.> > > > > >>>>>>> > > > > >>>>>>> Which leaves:> > > > > >>>>>>> 1. Finishing the write path (I estimate 2-3 weeks) to be > code > > > > >>>> complete> > > > > >>>>>>> 2. Implementing the read path.> > > > > >>>>>>> > > > > >>>>>>> Again, I'm happy to collaborate if people have bandwidth and > want > > > > >>>> to> > > > > >>>>>>> contribute.> > > > > >>>>>>> > > > > >>>>>>> Thanks,> > > > > >>>>>>> Micah> > > > > >>>>>>> > > > > >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield < > em...@gmail.com > > > >> > > > > >>>>>>> wrote:> > > > > >>>>>>> > > > > >>>>>>>> Hi Wes,> > > > > >>>>>>>> I'm still interested in doing the work. But don't to hold > > > > >>>> anybody up > > > > >>>>>> if> > > > > >>>>>>>> they have bandwidth.> > > > > >>>>>>>>> > > > > >>>>>>>> In order to actually make progress on this, my plan will be > to:> > > > > >>>>>>>> 1. Help with the current Java review backlog through early > next > > > > >>>> week > > > > >>>>>> or> > > > > >>>>>>>> so (this has been taking the majority of my time allocated > for > > > > >>>> Arrow> > > > > >>>>>>>> contributions for the last 6 months or so).> > > > > >>>>>>>> 2. Shift all my attention to trying to get this done (this > > > > >>>> means no> > > > > >>>>>>>> reviews other then closing out existing ones that I've > started > > > > >>>> until it > > > > >>>>>> is> > > > > >>>>>>>> done). Hopefully, other Java committers can help shrink the > > > > >>>> backlog> > > > > >>>>>>>> further (Jacques thanks for you recent efforts here).> > > > > >>>>>>>>> > > > > >>>>>>>> Thanks,> > > > > >>>>>>>> Micah> > > > > >>>>>>>>> > > > > >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney < > we...@gmail.com> > > > > >>>> wrote:> > > > > >>>>>>>>> > > > > >>>>>>>>> hi folks,> > > > > >>>>>>>>>> > > > > >>>>>>>>> I think we have reached a point where the incomplete C++ > > > > >>>> Parquet> > > > > >>>>>>>>> nested data assembly/disassembly is harming the value of > > > > >>>> several> > > > > >>>>>>>>> others parts of the project, for example the Datasets API. > As > > > > >>>> another> > > > > >>>>>>>>> example, it's possible to ingest nested data from JSON but > not > > > > >>>> write> > > > > >>>>>>>>> it to Parquet in general.> > > > > >>>>>>>>>> > > > > >>>>>>>>> Implementing the nested data read and write path > completely is > > > > >>>> a> > > > > >>>>>>>>> difficult project requiring at least several weeks of > dedicated > > > > >>>> work,> > > > > >>>>>>>>> so it's not so surprising that it hasn't been accomplished > yet. > > > > >>>> I > > > > >>>>>> know> > > > > >>>>>>>>> that several people have expressed interest in working on > it, > > > > >>>> but I> > > > > >>>>>>>>> would like to see if anyone would be able to volunteer a > > > > >>>> commitment > > > > >>>>>> of> > > > > >>>>>>>>> time and guess on a rough timeline when this work could be > > > > >>>> done. It> > > > > >>>>>>>>> seems to me if this slips beyond 2020 it will significant > > > > >>>> diminish > > > > >>>>>> the> > > > > >>>>>>>>> value being created by other parts of the project.> > > > > >>>>>>>>>> > > > > >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one > > > > >>>> candidate> > > > > >>>>>>>>> person to take on this project (and I can dedicate the > time, > > > > >>>> but it> > > > > >>>>>>>>> would come at the expense of other projects where I can > also > > > be> > > > > >>>>>>>>> useful). But Micah and others expressed interest in > working on > > > > >>>> it, so> > > > > >>>>>>>>> I wanted to have a discussion about it to see what others > > > > >>>> think.> > > > > >>>>>>>>>> > > > > >>>>>>>>> Thanks> > > > > >>>>>>>>> Wes> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>> > > > > >>> > > > > > > > >