Hi Micah, How does the performance change for “flat” schemas? (particularly in the case of a large number of columns)
Thanks, Maarten > On Mar 11, 2020, at 11:53 PM, Micah Kornfield <[email protected]> wrote: > > Another status update. I've integrated the level generation code with the > parquet writing code [1]. > > After that PR is merged I'll add bindings in Python to control versions of > the level generation algorithm and plan on moving on to the read side. > > Thanks, > Micah > > [1] https://github.com/apache/arrow/pull/6586 > > On Tue, Mar 3, 2020 at 9:07 PM Micah Kornfield <[email protected]> > wrote: > >> Hi Igor, >> If you have the time https://issues.apache.org/jira/browse/ARROW-7960 might >> be a good task to pick up for this I think it should be a relatively small >> amount of code, so it is probably a good contribution to the project. Once >> that is wrapped up we can see were we both are. >> >> Cheers, >> Micah >> >> On Tue, Mar 3, 2020 at 8:25 AM Igor Calabria <[email protected]> >> wrote: >> >>> Hi Micah, I actually got involved with another personal project and had >>> to postpone my contribution to arrow a bit. The good news is that I'm >>> almost done with it, so I could help you with the read side very soon. Any >>> ideas how we could coordinate this? >>> >>> Em qua., 26 de fev. de 2020 às 21:06, Wes McKinney <[email protected]> >>> escreveu: >>> >>>> hi Micah -- great news on the level generation PR. I'll try to carve >>>> out some time for reviewing over the coming week. >>>> >>>> On Wed, Feb 26, 2020 at 3:10 AM Micah Kornfield <[email protected]> >>>> wrote: >>>>> >>>>> Hi Igor, >>>>> I was wondering if you have made any progress on this? >>>>> >>>>> I posted a new PR [1] which I believe handles the difficult algorithmic >>>>> part of writing. There will be some follow-ups but I think this PR >>>> might >>>>> take a while to review, so I was thinking of starting to take a look >>>> at the >>>>> read side if you haven't started yet, and circle back to the final >>>>> integration for the write side once the PR is checked in. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> [1] https://github.com/apache/arrow/pull/6490 >>>>> >>>>> On Mon, Feb 3, 2020 at 4:08 PM Igor Calabria <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, I would love to help with this issue. I'm aware that this is a >>>> huge >>>>>> task for a first contribution to arrow, but I feel that I could help >>>> with >>>>>> the read path. >>>>>> Reading parquet seems like a extremely complex task since both >>>> hive[0] and >>>>>> spark[1] tried to implement a "vectorized" version and they all >>>> stopped >>>>>> short of supporting complex types. >>>>>> I wanted to at least give it a try and find out where the challenge >>>> lies. >>>>>> >>>>>> Since you guys are much more familiar with the current code base, I >>>> could >>>>>> use some starting tips so I don't fall in common pitfalls and >>>> whatnot. >>>>>> >>>>>> [0] https://issues.apache.org/jira/browse/HIVE-18576 >>>>>> [1] >>>>>> >>>>>> >>>> https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L45 >>>>>> >>>>>> On 2020/02/03 06:01:25, Micah Kornfield <[email protected]> wrote: >>>>>>> Just to give an update. I've been a little bit delayed, but my >>>> progress >>>>>> is> >>>>>>> as follows:> >>>>>>> 1. Had 1 PR merged that will exercise basic end-to-end tests.> >>>>>>> 2. Have another PR open that allows a configuration option in C++ >>>> to> >>>>>>> determine which algorithm version to use for reading/writing, the >>>>>> existing> >>>>>>> version and the new version supported complex-nested arrays. I >>>> think a> >>>>>>> large amount of code will be reused/delegated to but I will err on >>>> the >>>>>> side> >>>>>>> of not touching the existing code/algorithms so that any errors in >>>> the> >>>>>>> implementation or performance regressions can hopefully be >>>> mitigated at> >>>>>>> runtime. I expect in later releases (once the code has "baked") >>>> will> >>>>>>> become a no-op.> >>>>>>> 3. Started coding the write path.> >>>>>>> >>>>>>> Which leaves:> >>>>>>> 1. Finishing the write path (I estimate 2-3 weeks) to be code >>>> complete> >>>>>>> 2. Implementing the read path.> >>>>>>> >>>>>>> Again, I'm happy to collaborate if people have bandwidth and want >>>> to> >>>>>>> contribute.> >>>>>>> >>>>>>> Thanks,> >>>>>>> Micah> >>>>>>> >>>>>>> On Thu, Jan 9, 2020 at 10:31 PM Micah Kornfield <[email protected]>> >>>>>>> wrote:> >>>>>>> >>>>>>>> Hi Wes,> >>>>>>>> I'm still interested in doing the work. But don't to hold >>>> anybody up >>>>>> if> >>>>>>>> they have bandwidth.> >>>>>>>>> >>>>>>>> In order to actually make progress on this, my plan will be to:> >>>>>>>> 1. Help with the current Java review backlog through early next >>>> week >>>>>> or> >>>>>>>> so (this has been taking the majority of my time allocated for >>>> Arrow> >>>>>>>> contributions for the last 6 months or so).> >>>>>>>> 2. Shift all my attention to trying to get this done (this >>>> means no> >>>>>>>> reviews other then closing out existing ones that I've started >>>> until it >>>>>> is> >>>>>>>> done). Hopefully, other Java committers can help shrink the >>>> backlog> >>>>>>>> further (Jacques thanks for you recent efforts here).> >>>>>>>>> >>>>>>>> Thanks,> >>>>>>>> Micah> >>>>>>>>> >>>>>>>> On Thu, Jan 9, 2020 at 8:16 AM Wes McKinney <[email protected]> >>>> wrote:> >>>>>>>>> >>>>>>>>> hi folks,> >>>>>>>>>> >>>>>>>>> I think we have reached a point where the incomplete C++ >>>> Parquet> >>>>>>>>> nested data assembly/disassembly is harming the value of >>>> several> >>>>>>>>> others parts of the project, for example the Datasets API. As >>>> another> >>>>>>>>> example, it's possible to ingest nested data from JSON but not >>>> write> >>>>>>>>> it to Parquet in general.> >>>>>>>>>> >>>>>>>>> Implementing the nested data read and write path completely is >>>> a> >>>>>>>>> difficult project requiring at least several weeks of dedicated >>>> work,> >>>>>>>>> so it's not so surprising that it hasn't been accomplished yet. >>>> I >>>>>> know> >>>>>>>>> that several people have expressed interest in working on it, >>>> but I> >>>>>>>>> would like to see if anyone would be able to volunteer a >>>> commitment >>>>>> of> >>>>>>>>> time and guess on a rough timeline when this work could be >>>> done. It> >>>>>>>>> seems to me if this slips beyond 2020 it will significant >>>> diminish >>>>>> the> >>>>>>>>> value being created by other parts of the project.> >>>>>>>>>> >>>>>>>>> Since I'm pretty familiar with all the Parquet code I'm one >>>> candidate> >>>>>>>>> person to take on this project (and I can dedicate the time, >>>> but it> >>>>>>>>> would come at the expense of other projects where I can also be> >>>>>>>>> useful). But Micah and others expressed interest in working on >>>> it, so> >>>>>>>>> I wanted to have a discussion about it to see what others >>>> think.> >>>>>>>>>> >>>>>>>>> Thanks> >>>>>>>>> Wes> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>> >>>
