Based on feedback on the PR and my own review, I merged #38 Have a good week end.
> On Feb 5, 2016, at 3:23 PM, Wes McKinney <[email protected]> wrote: > > hi folks > > We are making good progress on the read path in parquet-cpp, but we > still have limited test coverage (and thus probably a bunch of > non-working code) in a few key areas > > - the file reader public API, generally > - column reader and scanner business logic > - decompression codecs (I'm going to pick up this patch -- this > weekend maybe -- https://github.com/apache/parquet-cpp/pull/11) > - parquet < 2.0 value decoders (level decoding is in good shape when > Deepak's level decoder patch is merged). For example PLAIN_DICTIONARY > decoding is not implemented > - parquet 2.0 value encodings (unclear how urgent this is) > - DataPageV2 > > The sooner we can get the schema patch > (https://github.com/apache/parquet-cpp/pull/38) merged the better to > proceed with filling the rest of these gaps. > > AFAICT we have JIRAs tracking almost all of these items (and some > other bugs) -- if you find some gaps something missing please create a > JIRA and update the roadmap doc > https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit > > Outside of functional requirements for read support, we have plenty of > C++ engineering tidying to do, like: > > - eliminating build dependencies from being transitively included in > public headers (i.e. Boost and Thrift) > - defining a public API in parquet/api/*.h for 3rd-party linkers > - cleaning up includes > (https://github.com/include-what-you-use/include-what-you-use) > - shared library symbol visibility (we may not need this for a while) > > Since file reading is the most overall pressing matter, I'm going to > tilt my efforts toward completing the read path by the end of the > month at the expense of the write path (outside of test fixtures to > generate faux serialized data). For my needs the remaining tricky bit > is columnar nested data structure reassembly but I'll defer on that > until the other aspects are in good shape. I estimate about 30-40% of > the effort is writing new code and 60-70% testing existing code and / > or refactoring to enable component-level unit testing. > > Thank you all in advance for your help, patches, and code reviews. > > best, > Wes > > On Wed, Jan 27, 2016 at 10:22 PM, Wes McKinney <[email protected]> wrote: >> Yeah, if the Apache build queue is clogged up with other projects' builds, >> and you have a green build on your personal repo, I suggest posting that on >> the PR and the reviewer can accept the patch after checking the git hash on >> the green build. Hopefully now Travis CI has sorted out the infrastructure >> problems so this won't happen again soon. >> >> On Wed, Jan 27, 2016 at 9:59 AM, Julien Le Dem <[email protected]> wrote: >>> >>> you can also enable travis for your personal repo which would have it's >>> own queue. >>> Then you can have the build running on your branches. >>> >>> On Wed, Jan 27, 2016 at 9:44 AM, Ryan Blue <[email protected]> wrote: >>>> >>>> I have no problem substituting local testing as long as we test all the >>>> environments that Travis does. I've done that to get around this problem in >>>> the past. It takes a while to run each maven test profile, but it works. >>>> >>>> rb >>>> >>>> On 01/26/2016 09:44 PM, Wes McKinney wrote: >>>>> >>>>> Also, things have been made much worse by Travis CI continuing to have >>>>> infrastructure problems. The ASF build queue on Travis CI had completely >>>>> stalled by this morning so that no builds were completing; fortunately >>>>> their support is quite responsible and they've resolved the queue >>>>> blockage, so builds are executing again. >>>>> >>>>> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> There's 3 more patches outstanding that are causing blockage (418, >>>>> 433, and 451/453), so I think if we get them merged today or >>>>> tomorrow when we should be able to proceed with some parallel >>>>> efforts without quite as much conflict. >>>>> >>>>> On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> I'm going to try to more active this week but I admittedly don't >>>>> have a lot of >>>>> time to work on this. I understand we need to get critical mass >>>>> in committers, >>>>> code, etc to keep this going but I think we're making good >>>>> progress. >>>>> >>>>> On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> Also as Nong mentioned, PRs should be prefixed by the jira >>>>> id followed by a ":" as follows "PARQUET-X: description" >>>>> that's just to have the reference in the git changelog. The >>>>> merge script enforces it. >>>>> >>>>> >>>>> On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> I'm happy too with Aliaksei, Deepak, Wes, etc reviewing >>>>> each other. >>>>> I see Nong (who's a committer) has been doing some >>>>> reviews already. >>>>> >>>>> When you guys reach a consensus on a PR and want it >>>>> merged please mention it in the PR (+1, LGTM) and >>>>> mention us directly (@julienledem, ...) to have it >>>>> merged. >>>>> >>>>> right now I see that #19 and #21 have been committed >>>>> (thanks Nong) but it is not clear to me in what order >>>>> the others should be committed. >>>>> >>>>> For example Deepak should comment directly on #22 to >>>>> approve it. Right now he mentioned it on another PR. >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139 >>>>> Similarly Wes could confirm on that PR whether it looks >>>>> good. >>>>> >>>>> Tomorrow is the Parquet sync up if you want to discuss >>>>> further: >>>>> >>>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo >>>>> >>>>> >>>>> On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> Aliaksei, thanks for being understanding here. >>>>> >>>>> I agree with you that it is too difficult. We really >>>>> want to get the cpp side bootstrapped as soon as >>>>> possible. Lets go with what you suggested, to have >>>>> contributors review one another's patches and then >>>>> ask a committer for a final review once both >>>>> contributors reach a consensus. >>>>> >>>>> If there are issues that are easy to review, maybe >>>>> some of us other than Nong can take a look. >>>>> >>>>> rb >>>>> >>>>> >>>>> On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote: >>>>> >>>>> Hi Ryan, >>>>> >>>>> This sounds very reasonable. I do not argue to >>>>> disregard the standard >>>>> Apache approach to promoting contributors to >>>>> committers. I am just >>>>> pointing out that without the input from current >>>>> committers it is hard >>>>> for us to productively contribute to the >>>>> project. As a consequence, it >>>>> is hard for us demonstrate our fit to become >>>>> committers in the future. >>>>> This leaves us in a deadlock, which can be >>>>> resolved either by an >>>>> increased feedback from existing committers or >>>>> by making us committers >>>>> sooner. >>>>> >>>>> I understand that most committers on the Parquet >>>>> project are working on >>>>> the Java implementation, so it can be harder for >>>>> them to review patches >>>>> for parquet-cpp. In this regard, how about the >>>>> following protocol for >>>>> parquet-cpp pull requests: After contributors >>>>> review and revise a pull >>>>> request and agree that it is in a good shape, we >>>>> will ask a designated >>>>> committer to review and commit the pull request. >>>>> So far we have been >>>>> asking Nong; if there is a better designated >>>>> committer for parquet-cpp, >>>>> please let us know. >>>>> >>>>> Thank you, >>>>> Aliaksei. >>>>> >>>>> >>>>> On 01/25/2016 04:54 PM, Ryan Blue wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> Sorry about the current backlog on the >>>>> parquet-cpp side. Most of the >>>>> current committer base works on the Java >>>>> implementation so it's either >>>>> slow or not reliable for us to do those >>>>> reviews. >>>>> >>>>> I think the best way to move forward is to >>>>> review patches for each >>>>> other. That will keep those issues >>>>> progressing, make it easy for >>>>> committers to validate the commit, and -- >>>>> most importantly -- to build >>>>> a trail of contributions that we can look at >>>>> to vote in new committers. >>>>> >>>>> I completely sympathize with the need for >>>>> committers on the CPP >>>>> project, but I don't think this will take a >>>>> long time given the >>>>> current level of activity. We're really just >>>>> trying to build >>>>> confidence that: >>>>> >>>>> 1. You produce quality contributions and >>>>> understand the codebase >>>>> 2. You give friendly, thoughtful reviews and >>>>> don't rubber-stamp >>>>> 3. You defer judgment and ask others when >>>>> you don't know >>>>> 4. You respect others and interact >>>>> professionally >>>>> >>>>> I don't think any of those are that hard to >>>>> demonstrate, but I'd be >>>>> uncomfortable not validating committers like >>>>> we normally do. >>>>> Especially in this situation, where I could >>>>> easily see the amount of >>>>> work you guys are doing adding up pretty >>>>> quickly! >>>>> >>>>> Does that sound like a reasonable path >>>>> forward? >>>>> >>>>> rb >>>>> >>>>> >>>>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila >>>>> wrote: >>>>> >>>>> Hi Nong and Julien, >>>>> >>>>> As Wes has pointed out, we have a number >>>>> of patches for parquet-cpp >>>>> outstanding. Wes, Deepak, and I have >>>>> been reviewing each other's pull >>>>> requests. At this point, the patches >>>>> need to be reviewed and approved by >>>>> Parquet committers in order to be >>>>> committed to master. >>>>> >>>>> Unfortunately, there is not much >>>>> activity on this side of the project. >>>>> The lack of response from current >>>>> committers is holding us back, and we >>>>> have to repeatedly rebase our batches, >>>>> merge multiple pull requests >>>>> together, and overall step on each >>>>> others' toes. >>>>> >>>>> Is it possible to make Wes, Deepak, and >>>>> me committers on the project, so >>>>> we can contribute to parquet-cpp more >>>>> efficiently? >>>>> >>>>> Thanks, >>>>> Aliaksei. >>>>> >>>>> >>>>> On 01/23/2016 06:07 PM, Wes McKinney >>>>> wrote: >>>>> >>>>> Folks, >>>>> >>>>> We're working on a pretty solid >>>>> patch queue. >>>>> >>>>> independent patches >>>>> PARQUET-449: >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/21 >>>>> >>>>> interdependent patches (order to >>>>> apply patches) >>>>> PARQUET-437 (MOSTLY REVIEWED): >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/19 >>>>> >>>>> PARQUET-418: >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/18 >>>>> PARQUET-434: >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/20 >>>>> PARQUET-433: >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/22 >>>>> PARQUET-451 & PARQUET-453: >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/23 >>>>> >>>>> PARQUET-428 (needs to be rebased on >>>>> top of PARQUET-433): >>>>> >>>>> https://github.com/apache/parquet-cpp/pull/24 >>>>> >>>>> I'm going to take a breather and >>>>> work on some other things this >>>>> weekend, >>>>> but I'll be available for code >>>>> reviews and fixes to try to move >>>>> along >>>>> this >>>>> patch queue. >>>>> >>>>> Thanks, >>>>> Wes >>>>> >>>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes >>>>> McKinney <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Great to meet you all! >>>>> >>>>> I've recently been collaborating >>>>> with the Apache Drill team to >>>>> spin >>>>> out >>>>> the ValueVector columnar >>>>> in-memory data structure into a >>>>> new >>>>> standalone >>>>> project that will be called >>>>> Arrow [1] [2]. A brief summary >>>>> of >>>>> Arrow/ValueVectors is that it >>>>> permits O(1) random access on >>>>> nested >>>>> columnar >>>>> structures and is efficient for >>>>> projections and scans in a >>>>> columnar >>>>> SQL >>>>> setting. >>>>> >>>>> I'm very interested in making >>>>> Parquet read/write support >>>>> available to >>>>> Python programmers via C/C++ >>>>> extensions, so I'm going to be >>>>> working >>>>> the >>>>> next few months on a >>>>> Parquet->Arrow->Python >>>>> toolchain, along with some >>>>> tools to manipulate tables >>>>> in-memory columnar data in the >>>>> style of >>>>> Python's >>>>> pandas library. >>>>> >>>>> I will propose patches as needed >>>>> to parquet-cpp to improve its >>>>> performance >>>>> and add functionality for >>>>> writing Parquet files as well. >>>>> The >>>>> details of >>>>> converting to/from Parquet's >>>>> repetition/definition level >>>>> representation of >>>>> nested data will stay separate >>>>> in the arrow-parquet adapter >>>>> code. >>>>> >>>>> cheers, >>>>> Wes >>>>> >>>>> [1]: >>>>> >>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E >>>>> >>>>> >>>>> [2]: >>>>> >>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490 >>>>> >>>>> >>>>> On Fri, Jan 15, 2016 at 1:22 AM, >>>>> Mickaël Lacour >>>>> <[email protected] >>>>> <mailto:[email protected]>> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I'm very interested in this >>>>> subject because I would like >>>>> to export >>>>> parquet data from HDFS to >>>>> Vertica (using VSQL). >>>>> I'm planning to work on it >>>>> next quarter, but I will be >>>>> very happy to >>>>> help >>>>> you on this subject (review, >>>>> testing). >>>>> >>>>> Have a nice day, >>>>> -- >>>>> Mickaël Lacour >>>>> Senior Software Engineer >>>>> Analytics Infrastructure >>>>> team @Scalability >>>>> >>>>> >>>>> ________________________________________ >>>>> From: Walkauskas, Stephen >>>>> Gregory (Vertica) >>>>> <[email protected] >>>>> >>>>> <mailto:[email protected]>> >>>>> Sent: Thursday, January 14, >>>>> 2016 3:23 PM >>>>> To: Sandryhaila, Aliaksei; >>>>> [email protected] >>>>> >>>>> <mailto:[email protected]>; >>>>> Majeti, Deepak; >>>>> [email protected] >>>>> <mailto:[email protected]>; >>>>> >>>>> Wes McKinney >>>>> Subject: Re: Parquet-cpp >>>>> >>>>> Yes, thanks for the >>>>> introduction Julien. >>>>> >>>>> Nong and Wes, >>>>> >>>>> It'd be interesting to know >>>>> your goals for parquet-cpp. >>>>> >>>>> The Vertica database already >>>>> supports optimized reads of >>>>> ORC files >>>>> (fast >>>>> c++ parser, predicate >>>>> pushdown, columns selection >>>>> etc). We'd like >>>>> to do >>>>> the same for parquet. >>>>> >>>>> Cheers, >>>>> Stephen >>>>> >>>>> On 01/13/2016 05:53 PM, >>>>> Sandryhaila, Aliaksei wrote: >>>>> >>>>> Thank you for the >>>>> introduction, Julien! >>>>> >>>>> Hello Nong and Wes, >>>>> >>>>> Stephen, Deepak and I >>>>> are developing a C++ >>>>> library to support >>>>> Parquet in >>>>> Vertica RDBMS. We are >>>>> using Parquet-cpp as a >>>>> starting point and are >>>>> expanding its >>>>> functionality as well as >>>>> improving it and fixing >>>>> bugs. We >>>>> would like to contribute >>>>> these improvements back >>>>> to the open-source >>>>> community. We plan to do >>>>> this through the usual >>>>> process of creating >>>>> jiras that justify and >>>>> explain a code change, >>>>> and then submitting >>>>> pull >>>>> requests. We look >>>>> forward to working with >>>>> you on Parquet-cpp and >>>>> to >>>>> your >>>>> feedback and >>>>> suggestions. >>>>> >>>>> Best regards, >>>>> Aliaksei. >>>>> >>>>> >>>>> On 01/13/2016 02:54 PM, >>>>> Julien Le Dem wrote: >>>>> >>>>> Hello Nong, Wes, >>>>> Stephen, Deepak and >>>>> Aliaksei >>>>> I wanted to >>>>> introduce you to >>>>> each other as you >>>>> are all looking at >>>>> Parquet-cpp. >>>>> >>>>> I'd recommend >>>>> opening JIRAs in the >>>>> parquet-cpp >>>>> component to >>>>> >>>>> collaborate (I >>>>> >>>>> see you already >>>>> doing this): >>>>> >>>>> >>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp >>>>> >>>>> >>>>> Nong is a committer >>>>> and can merged pull >>>>> requests (he also >>>>> understands >>>>> >>>>> that >>>>> >>>>> code base very >>>>> well). >>>>> Other committer can >>>>> too, feel free to >>>>> ping us if you need >>>>> help >>>>> Obviously, you don't >>>>> need to be a >>>>> committer to give >>>>> others reviews >>>>> (you >>>>> just need one to >>>>> approve and merge). >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Cloudera, Inc. >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Julien >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Julien >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Cloudera, Inc. >>> >>> >>> >>> >>> -- >>> Julien >> >>
