hi folks We are making good progress on the read path in parquet-cpp, but we still have limited test coverage (and thus probably a bunch of non-working code) in a few key areas
- the file reader public API, generally - column reader and scanner business logic - decompression codecs (I'm going to pick up this patch -- this weekend maybe -- https://github.com/apache/parquet-cpp/pull/11) - parquet < 2.0 value decoders (level decoding is in good shape when Deepak's level decoder patch is merged). For example PLAIN_DICTIONARY decoding is not implemented - parquet 2.0 value encodings (unclear how urgent this is) - DataPageV2 The sooner we can get the schema patch (https://github.com/apache/parquet-cpp/pull/38) merged the better to proceed with filling the rest of these gaps. AFAICT we have JIRAs tracking almost all of these items (and some other bugs) -- if you find some gaps something missing please create a JIRA and update the roadmap doc https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit Outside of functional requirements for read support, we have plenty of C++ engineering tidying to do, like: - eliminating build dependencies from being transitively included in public headers (i.e. Boost and Thrift) - defining a public API in parquet/api/*.h for 3rd-party linkers - cleaning up includes (https://github.com/include-what-you-use/include-what-you-use) - shared library symbol visibility (we may not need this for a while) Since file reading is the most overall pressing matter, I'm going to tilt my efforts toward completing the read path by the end of the month at the expense of the write path (outside of test fixtures to generate faux serialized data). For my needs the remaining tricky bit is columnar nested data structure reassembly but I'll defer on that until the other aspects are in good shape. I estimate about 30-40% of the effort is writing new code and 60-70% testing existing code and / or refactoring to enable component-level unit testing. Thank you all in advance for your help, patches, and code reviews. best, Wes On Wed, Jan 27, 2016 at 10:22 PM, Wes McKinney <[email protected]> wrote: > Yeah, if the Apache build queue is clogged up with other projects' builds, > and you have a green build on your personal repo, I suggest posting that on > the PR and the reviewer can accept the patch after checking the git hash on > the green build. Hopefully now Travis CI has sorted out the infrastructure > problems so this won't happen again soon. > > On Wed, Jan 27, 2016 at 9:59 AM, Julien Le Dem <[email protected]> wrote: >> >> you can also enable travis for your personal repo which would have it's >> own queue. >> Then you can have the build running on your branches. >> >> On Wed, Jan 27, 2016 at 9:44 AM, Ryan Blue <[email protected]> wrote: >>> >>> I have no problem substituting local testing as long as we test all the >>> environments that Travis does. I've done that to get around this problem in >>> the past. It takes a while to run each maven test profile, but it works. >>> >>> rb >>> >>> On 01/26/2016 09:44 PM, Wes McKinney wrote: >>>> >>>> Also, things have been made much worse by Travis CI continuing to have >>>> infrastructure problems. The ASF build queue on Travis CI had completely >>>> stalled by this morning so that no builds were completing; fortunately >>>> their support is quite responsible and they've resolved the queue >>>> blockage, so builds are executing again. >>>> >>>> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> There's 3 more patches outstanding that are causing blockage (418, >>>> 433, and 451/453), so I think if we get them merged today or >>>> tomorrow when we should be able to proceed with some parallel >>>> efforts without quite as much conflict. >>>> >>>> On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> I'm going to try to more active this week but I admittedly don't >>>> have a lot of >>>> time to work on this. I understand we need to get critical mass >>>> in committers, >>>> code, etc to keep this going but I think we're making good >>>> progress. >>>> >>>> On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem >>>> <[email protected] <mailto:[email protected]>> wrote: >>>> >>>> Also as Nong mentioned, PRs should be prefixed by the jira >>>> id followed by a ":" as follows "PARQUET-X: description" >>>> that's just to have the reference in the git changelog. The >>>> merge script enforces it. >>>> >>>> >>>> On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem >>>> <[email protected] <mailto:[email protected]>> wrote: >>>> >>>> I'm happy too with Aliaksei, Deepak, Wes, etc reviewing >>>> each other. >>>> I see Nong (who's a committer) has been doing some >>>> reviews already. >>>> >>>> When you guys reach a consensus on a PR and want it >>>> merged please mention it in the PR (+1, LGTM) and >>>> mention us directly (@julienledem, ...) to have it >>>> merged. >>>> >>>> right now I see that #19 and #21 have been committed >>>> (thanks Nong) but it is not clear to me in what order >>>> the others should be committed. >>>> >>>> For example Deepak should comment directly on #22 to >>>> approve it. Right now he mentioned it on another PR. >>>> >>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139 >>>> Similarly Wes could confirm on that PR whether it looks >>>> good. >>>> >>>> Tomorrow is the Parquet sync up if you want to discuss >>>> further: >>>> >>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo >>>> >>>> >>>> On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue >>>> <[email protected] <mailto:[email protected]>> wrote: >>>> >>>> Aliaksei, thanks for being understanding here. >>>> >>>> I agree with you that it is too difficult. We really >>>> want to get the cpp side bootstrapped as soon as >>>> possible. Lets go with what you suggested, to have >>>> contributors review one another's patches and then >>>> ask a committer for a final review once both >>>> contributors reach a consensus. >>>> >>>> If there are issues that are easy to review, maybe >>>> some of us other than Nong can take a look. >>>> >>>> rb >>>> >>>> >>>> On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote: >>>> >>>> Hi Ryan, >>>> >>>> This sounds very reasonable. I do not argue to >>>> disregard the standard >>>> Apache approach to promoting contributors to >>>> committers. I am just >>>> pointing out that without the input from current >>>> committers it is hard >>>> for us to productively contribute to the >>>> project. As a consequence, it >>>> is hard for us demonstrate our fit to become >>>> committers in the future. >>>> This leaves us in a deadlock, which can be >>>> resolved either by an >>>> increased feedback from existing committers or >>>> by making us committers >>>> sooner. >>>> >>>> I understand that most committers on the Parquet >>>> project are working on >>>> the Java implementation, so it can be harder for >>>> them to review patches >>>> for parquet-cpp. In this regard, how about the >>>> following protocol for >>>> parquet-cpp pull requests: After contributors >>>> review and revise a pull >>>> request and agree that it is in a good shape, we >>>> will ask a designated >>>> committer to review and commit the pull request. >>>> So far we have been >>>> asking Nong; if there is a better designated >>>> committer for parquet-cpp, >>>> please let us know. >>>> >>>> Thank you, >>>> Aliaksei. >>>> >>>> >>>> On 01/25/2016 04:54 PM, Ryan Blue wrote: >>>> >>>> Hi everyone, >>>> >>>> Sorry about the current backlog on the >>>> parquet-cpp side. Most of the >>>> current committer base works on the Java >>>> implementation so it's either >>>> slow or not reliable for us to do those >>>> reviews. >>>> >>>> I think the best way to move forward is to >>>> review patches for each >>>> other. That will keep those issues >>>> progressing, make it easy for >>>> committers to validate the commit, and -- >>>> most importantly -- to build >>>> a trail of contributions that we can look at >>>> to vote in new committers. >>>> >>>> I completely sympathize with the need for >>>> committers on the CPP >>>> project, but I don't think this will take a >>>> long time given the >>>> current level of activity. We're really just >>>> trying to build >>>> confidence that: >>>> >>>> 1. You produce quality contributions and >>>> understand the codebase >>>> 2. You give friendly, thoughtful reviews and >>>> don't rubber-stamp >>>> 3. You defer judgment and ask others when >>>> you don't know >>>> 4. You respect others and interact >>>> professionally >>>> >>>> I don't think any of those are that hard to >>>> demonstrate, but I'd be >>>> uncomfortable not validating committers like >>>> we normally do. >>>> Especially in this situation, where I could >>>> easily see the amount of >>>> work you guys are doing adding up pretty >>>> quickly! >>>> >>>> Does that sound like a reasonable path >>>> forward? >>>> >>>> rb >>>> >>>> >>>> On 01/25/2016 12:46 PM, Aliaksei Sandryhaila >>>> wrote: >>>> >>>> Hi Nong and Julien, >>>> >>>> As Wes has pointed out, we have a number >>>> of patches for parquet-cpp >>>> outstanding. Wes, Deepak, and I have >>>> been reviewing each other's pull >>>> requests. At this point, the patches >>>> need to be reviewed and approved by >>>> Parquet committers in order to be >>>> committed to master. >>>> >>>> Unfortunately, there is not much >>>> activity on this side of the project. >>>> The lack of response from current >>>> committers is holding us back, and we >>>> have to repeatedly rebase our batches, >>>> merge multiple pull requests >>>> together, and overall step on each >>>> others' toes. >>>> >>>> Is it possible to make Wes, Deepak, and >>>> me committers on the project, so >>>> we can contribute to parquet-cpp more >>>> efficiently? >>>> >>>> Thanks, >>>> Aliaksei. >>>> >>>> >>>> On 01/23/2016 06:07 PM, Wes McKinney >>>> wrote: >>>> >>>> Folks, >>>> >>>> We're working on a pretty solid >>>> patch queue. >>>> >>>> independent patches >>>> PARQUET-449: >>>> >>>> https://github.com/apache/parquet-cpp/pull/21 >>>> >>>> interdependent patches (order to >>>> apply patches) >>>> PARQUET-437 (MOSTLY REVIEWED): >>>> >>>> https://github.com/apache/parquet-cpp/pull/19 >>>> >>>> PARQUET-418: >>>> >>>> https://github.com/apache/parquet-cpp/pull/18 >>>> PARQUET-434: >>>> >>>> https://github.com/apache/parquet-cpp/pull/20 >>>> PARQUET-433: >>>> >>>> https://github.com/apache/parquet-cpp/pull/22 >>>> PARQUET-451 & PARQUET-453: >>>> >>>> https://github.com/apache/parquet-cpp/pull/23 >>>> >>>> PARQUET-428 (needs to be rebased on >>>> top of PARQUET-433): >>>> >>>> https://github.com/apache/parquet-cpp/pull/24 >>>> >>>> I'm going to take a breather and >>>> work on some other things this >>>> weekend, >>>> but I'll be available for code >>>> reviews and fixes to try to move >>>> along >>>> this >>>> patch queue. >>>> >>>> Thanks, >>>> Wes >>>> >>>> On Fri, Jan 15, 2016 at 8:18 AM, Wes >>>> McKinney <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Great to meet you all! >>>> >>>> I've recently been collaborating >>>> with the Apache Drill team to >>>> spin >>>> out >>>> the ValueVector columnar >>>> in-memory data structure into a >>>> new >>>> standalone >>>> project that will be called >>>> Arrow [1] [2]. A brief summary >>>> of >>>> Arrow/ValueVectors is that it >>>> permits O(1) random access on >>>> nested >>>> columnar >>>> structures and is efficient for >>>> projections and scans in a >>>> columnar >>>> SQL >>>> setting. >>>> >>>> I'm very interested in making >>>> Parquet read/write support >>>> available to >>>> Python programmers via C/C++ >>>> extensions, so I'm going to be >>>> working >>>> the >>>> next few months on a >>>> Parquet->Arrow->Python >>>> toolchain, along with some >>>> tools to manipulate tables >>>> in-memory columnar data in the >>>> style of >>>> Python's >>>> pandas library. >>>> >>>> I will propose patches as needed >>>> to parquet-cpp to improve its >>>> performance >>>> and add functionality for >>>> writing Parquet files as well. >>>> The >>>> details of >>>> converting to/from Parquet's >>>> repetition/definition level >>>> representation of >>>> nested data will stay separate >>>> in the arrow-parquet adapter >>>> code. >>>> >>>> cheers, >>>> Wes >>>> >>>> [1]: >>>> >>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E >>>> >>>> >>>> [2]: >>>> >>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490 >>>> >>>> >>>> On Fri, Jan 15, 2016 at 1:22 AM, >>>> Mickaël Lacour >>>> <[email protected] >>>> <mailto:[email protected]>> >>>> wrote: >>>> >>>> Hi, >>>> >>>> I'm very interested in this >>>> subject because I would like >>>> to export >>>> parquet data from HDFS to >>>> Vertica (using VSQL). >>>> I'm planning to work on it >>>> next quarter, but I will be >>>> very happy to >>>> help >>>> you on this subject (review, >>>> testing). >>>> >>>> Have a nice day, >>>> -- >>>> Mickaël Lacour >>>> Senior Software Engineer >>>> Analytics Infrastructure >>>> team @Scalability >>>> >>>> >>>> ________________________________________ >>>> From: Walkauskas, Stephen >>>> Gregory (Vertica) >>>> <[email protected] >>>> >>>> <mailto:[email protected]>> >>>> Sent: Thursday, January 14, >>>> 2016 3:23 PM >>>> To: Sandryhaila, Aliaksei; >>>> [email protected] >>>> >>>> <mailto:[email protected]>; >>>> Majeti, Deepak; >>>> [email protected] >>>> <mailto:[email protected]>; >>>> >>>> Wes McKinney >>>> Subject: Re: Parquet-cpp >>>> >>>> Yes, thanks for the >>>> introduction Julien. >>>> >>>> Nong and Wes, >>>> >>>> It'd be interesting to know >>>> your goals for parquet-cpp. >>>> >>>> The Vertica database already >>>> supports optimized reads of >>>> ORC files >>>> (fast >>>> c++ parser, predicate >>>> pushdown, columns selection >>>> etc). We'd like >>>> to do >>>> the same for parquet. >>>> >>>> Cheers, >>>> Stephen >>>> >>>> On 01/13/2016 05:53 PM, >>>> Sandryhaila, Aliaksei wrote: >>>> >>>> Thank you for the >>>> introduction, Julien! >>>> >>>> Hello Nong and Wes, >>>> >>>> Stephen, Deepak and I >>>> are developing a C++ >>>> library to support >>>> Parquet in >>>> Vertica RDBMS. We are >>>> using Parquet-cpp as a >>>> starting point and are >>>> expanding its >>>> functionality as well as >>>> improving it and fixing >>>> bugs. We >>>> would like to contribute >>>> these improvements back >>>> to the open-source >>>> community. We plan to do >>>> this through the usual >>>> process of creating >>>> jiras that justify and >>>> explain a code change, >>>> and then submitting >>>> pull >>>> requests. We look >>>> forward to working with >>>> you on Parquet-cpp and >>>> to >>>> your >>>> feedback and >>>> suggestions. >>>> >>>> Best regards, >>>> Aliaksei. >>>> >>>> >>>> On 01/13/2016 02:54 PM, >>>> Julien Le Dem wrote: >>>> >>>> Hello Nong, Wes, >>>> Stephen, Deepak and >>>> Aliaksei >>>> I wanted to >>>> introduce you to >>>> each other as you >>>> are all looking at >>>> Parquet-cpp. >>>> >>>> I'd recommend >>>> opening JIRAs in the >>>> parquet-cpp >>>> component to >>>> >>>> collaborate (I >>>> >>>> see you already >>>> doing this): >>>> >>>> >>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp >>>> >>>> >>>> Nong is a committer >>>> and can merged pull >>>> requests (he also >>>> understands >>>> >>>> that >>>> >>>> code base very >>>> well). >>>> Other committer can >>>> too, feel free to >>>> ping us if you need >>>> help >>>> Obviously, you don't >>>> need to be a >>>> committer to give >>>> others reviews >>>> (you >>>> just need one to >>>> approve and merge). >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Cloudera, Inc. >>>> >>>> >>>> >>>> >>>> -- >>>> Julien >>>> >>>> >>>> >>>> >>>> -- >>>> Julien >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Cloudera, Inc. >> >> >> >> >> -- >> Julien > >
