Re: Parquet-cpp

Julien Le Dem Fri, 05 Feb 2016 18:05:20 -0800

Based on feedback on the PR and my own review, I merged #38
Have a good week end.


> On Feb 5, 2016, at 3:23 PM, Wes McKinney <[email protected]> wrote:
> 
> hi folks
> 
> We are making good progress on the read path in parquet-cpp, but we
> still have limited test coverage (and thus probably a bunch of
> non-working code) in a few key areas
> 
> - the file reader public API, generally
> - column reader and scanner business logic
> - decompression codecs (I'm going to pick up this patch -- this
> weekend maybe -- https://github.com/apache/parquet-cpp/pull/11)
> - parquet < 2.0 value decoders (level decoding is in good shape when
> Deepak's level decoder patch is merged). For example PLAIN_DICTIONARY
> decoding is not implemented
> - parquet 2.0 value encodings (unclear how urgent this is)
> - DataPageV2
> 
> The sooner we can get the schema patch
> (https://github.com/apache/parquet-cpp/pull/38) merged the better to
> proceed with filling the rest of these gaps.
> 
> AFAICT we have JIRAs tracking almost all of these items (and some
> other bugs) -- if you find some gaps something missing please create a
> JIRA and update the roadmap doc
> https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit
> 
> Outside of functional requirements for read support, we have plenty of
> C++ engineering tidying to do, like:
> 
> - eliminating build dependencies from being transitively included in
> public headers (i.e. Boost and Thrift)
> - defining a public API in parquet/api/*.h for 3rd-party linkers
> - cleaning up includes
> (https://github.com/include-what-you-use/include-what-you-use)
> - shared library symbol visibility (we may not need this for a while)
> 
> Since file reading is the most overall pressing matter, I'm going to
> tilt my efforts toward completing the read path by the end of the
> month at the expense of the write path (outside of test fixtures to
> generate faux serialized data). For my needs the remaining tricky bit
> is columnar nested data structure reassembly but I'll defer on that
> until the other aspects are in good shape. I estimate about 30-40% of
> the effort is writing new code and 60-70% testing existing code and /
> or refactoring to enable component-level unit testing.
> 
> Thank you all in advance for your help, patches, and code reviews.
> 
> best,
> Wes
> 
> On Wed, Jan 27, 2016 at 10:22 PM, Wes McKinney <[email protected]> wrote:
>> Yeah, if the Apache build queue is clogged up with other projects' builds,
>> and you have a green build on your personal repo, I suggest posting that on
>> the PR and the reviewer can accept the patch after checking the git hash on
>> the green build. Hopefully now Travis CI has sorted out the infrastructure
>> problems so this won't happen again soon.
>> 
>> On Wed, Jan 27, 2016 at 9:59 AM, Julien Le Dem <[email protected]> wrote:
>>> 
>>> you can also enable travis for your personal repo which would have it's
>>> own queue.
>>> Then you can have the build running on your branches.
>>> 
>>> On Wed, Jan 27, 2016 at 9:44 AM, Ryan Blue <[email protected]> wrote:
>>>> 
>>>> I have no problem substituting local testing as long as we test all the
>>>> environments that Travis does. I've done that to get around this problem in
>>>> the past. It takes a while to run each maven test profile, but it works.
>>>> 
>>>> rb
>>>> 
>>>> On 01/26/2016 09:44 PM, Wes McKinney wrote:
>>>>> 
>>>>> Also, things have been made much worse by Travis CI continuing to have
>>>>> infrastructure problems. The ASF build queue on Travis CI had completely
>>>>> stalled by this morning so that no builds were completing; fortunately
>>>>> their support is quite responsible and they've resolved the queue
>>>>> blockage, so builds are executing again.
>>>>> 
>>>>> On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>>    There's 3 more patches outstanding that are causing blockage (418,
>>>>>    433, and 451/453), so I think if we get them merged today or
>>>>>    tomorrow when we should be able to proceed with some parallel
>>>>>    efforts without quite as much conflict.
>>>>> 
>>>>>    On Tue, Jan 26, 2016 at 3:56 PM, Nong Li <[email protected]
>>>>>    <mailto:[email protected]>> wrote:
>>>>> 
>>>>>        I'm going to try to more active this week but I admittedly don't
>>>>>        have a lot of
>>>>>        time to work on this. I understand we need to get critical mass
>>>>>        in committers,
>>>>>        code, etc to keep this going but I think we're making good
>>>>> progress.
>>>>> 
>>>>>        On Tue, Jan 26, 2016 at 3:27 PM, Julien Le Dem
>>>>>        <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>>            Also as Nong mentioned, PRs should be prefixed by the jira
>>>>>            id followed by a ":" as follows "PARQUET-X: description"
>>>>>            that's just to have the reference in the git changelog. The
>>>>>            merge script enforces it.
>>>>> 
>>>>> 
>>>>>            On Tue, Jan 26, 2016 at 3:24 PM, Julien Le Dem
>>>>>            <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>>                I'm happy too with Aliaksei, Deepak, Wes, etc reviewing
>>>>>                each other.
>>>>>                I see Nong (who's a committer) has been doing some
>>>>>                reviews already.
>>>>> 
>>>>>                When you guys reach a consensus on a PR and want it
>>>>>                merged please mention it in the PR (+1, LGTM) and
>>>>>                mention us directly (@julienledem, ...) to have it
>>>>> merged.
>>>>> 
>>>>>                right now I see that #19 and #21 have been committed
>>>>>                (thanks Nong) but it is not clear to me in what order
>>>>>                the others should be committed.
>>>>> 
>>>>>                For example Deepak should comment directly on #22 to
>>>>>                approve it. Right now he mentioned it on another PR.
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/24#issuecomment-174354139
>>>>>                Similarly Wes could confirm on that PR whether it looks
>>>>>                good.
>>>>> 
>>>>>                Tomorrow is the Parquet sync up if you want to discuss
>>>>>                further:
>>>>> 
>>>>> https://plus.google.com/u/0/events/cvgi67jmoptmgb1i488re8scbuo
>>>>> 
>>>>> 
>>>>>                On Mon, Jan 25, 2016 at 4:20 PM, Ryan Blue
>>>>>                <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>>                    Aliaksei, thanks for being understanding here.
>>>>> 
>>>>>                    I agree with you that it is too difficult. We really
>>>>>                    want to get the cpp side bootstrapped as soon as
>>>>>                    possible. Lets go with what you suggested, to have
>>>>>                    contributors review one another's patches and then
>>>>>                    ask a committer for a final review once both
>>>>>                    contributors reach a consensus.
>>>>> 
>>>>>                    If there are issues that are easy to review, maybe
>>>>>                    some of us other than Nong can take a look.
>>>>> 
>>>>>                    rb
>>>>> 
>>>>> 
>>>>>                    On 01/25/2016 02:33 PM, Aliaksei Sandryhaila wrote:
>>>>> 
>>>>>                        Hi Ryan,
>>>>> 
>>>>>                        This sounds very reasonable. I do not argue to
>>>>>                        disregard the standard
>>>>>                        Apache approach to promoting contributors to
>>>>>                        committers. I am just
>>>>>                        pointing out that without the input from current
>>>>>                        committers it is hard
>>>>>                        for us to productively contribute to the
>>>>>                        project. As a consequence, it
>>>>>                        is hard for us demonstrate our fit to become
>>>>>                        committers in the future.
>>>>>                        This leaves us in a deadlock, which can be
>>>>>                        resolved either by an
>>>>>                        increased feedback from existing committers or
>>>>>                        by making us committers
>>>>>                        sooner.
>>>>> 
>>>>>                        I understand that most committers on the Parquet
>>>>>                        project are working on
>>>>>                        the Java implementation, so it can be harder for
>>>>>                        them to review patches
>>>>>                        for parquet-cpp. In this regard, how about the
>>>>>                        following protocol for
>>>>>                        parquet-cpp pull requests: After contributors
>>>>>                        review and revise a pull
>>>>>                        request and agree that it is in a good shape, we
>>>>>                        will ask a designated
>>>>>                        committer to review and commit the pull request.
>>>>>                        So far we have been
>>>>>                        asking Nong; if there is a better designated
>>>>>                        committer for parquet-cpp,
>>>>>                        please let us know.
>>>>> 
>>>>>                        Thank you,
>>>>>                        Aliaksei.
>>>>> 
>>>>> 
>>>>>                        On 01/25/2016 04:54 PM, Ryan Blue wrote:
>>>>> 
>>>>>                            Hi everyone,
>>>>> 
>>>>>                            Sorry about the current backlog on the
>>>>>                            parquet-cpp side. Most of the
>>>>>                            current committer base works on the Java
>>>>>                            implementation so it's either
>>>>>                            slow or not reliable for us to do those
>>>>> reviews.
>>>>> 
>>>>>                            I think the best way to move forward is to
>>>>>                            review patches for each
>>>>>                            other. That will keep those issues
>>>>>                            progressing, make it easy for
>>>>>                            committers to validate the commit, and --
>>>>>                            most importantly -- to build
>>>>>                            a trail of contributions that we can look at
>>>>>                            to vote in new committers.
>>>>> 
>>>>>                            I completely sympathize with the need for
>>>>>                            committers on the CPP
>>>>>                            project, but I don't think this will take a
>>>>>                            long time given the
>>>>>                            current level of activity. We're really just
>>>>>                            trying to build
>>>>>                            confidence that:
>>>>> 
>>>>>                            1. You produce quality contributions and
>>>>>                            understand the codebase
>>>>>                            2. You give friendly, thoughtful reviews and
>>>>>                            don't rubber-stamp
>>>>>                            3. You defer judgment and ask others when
>>>>>                            you don't know
>>>>>                            4. You respect others and interact
>>>>>                            professionally
>>>>> 
>>>>>                            I don't think any of those are that hard to
>>>>>                            demonstrate, but I'd be
>>>>>                            uncomfortable not validating committers like
>>>>>                            we normally do.
>>>>>                            Especially in this situation, where I could
>>>>>                            easily see the amount of
>>>>>                            work you guys are doing adding up pretty
>>>>>                            quickly!
>>>>> 
>>>>>                            Does that sound like a reasonable path
>>>>> forward?
>>>>> 
>>>>>                            rb
>>>>> 
>>>>> 
>>>>>                            On 01/25/2016 12:46 PM, Aliaksei Sandryhaila
>>>>>                            wrote:
>>>>> 
>>>>>                                Hi Nong and Julien,
>>>>> 
>>>>>                                As Wes has pointed out, we have a number
>>>>>                                of patches for parquet-cpp
>>>>>                                outstanding. Wes, Deepak, and I have
>>>>>                                been reviewing each other's pull
>>>>>                                requests. At this point, the patches
>>>>>                                need to be reviewed and approved by
>>>>>                                Parquet committers in order to be
>>>>>                                committed to master.
>>>>> 
>>>>>                                Unfortunately, there is not much
>>>>>                                activity on this side of the project.
>>>>>                                The lack of response from current
>>>>>                                committers is holding us back, and we
>>>>>                                have to repeatedly rebase our batches,
>>>>>                                merge multiple pull requests
>>>>>                                together, and overall step on each
>>>>>                                others' toes.
>>>>> 
>>>>>                                Is it possible to make Wes, Deepak, and
>>>>>                                me committers on the project, so
>>>>>                                we can contribute to parquet-cpp more
>>>>>                                efficiently?
>>>>> 
>>>>>                                Thanks,
>>>>>                                Aliaksei.
>>>>> 
>>>>> 
>>>>>                                On 01/23/2016 06:07 PM, Wes McKinney
>>>>> wrote:
>>>>> 
>>>>>                                    Folks,
>>>>> 
>>>>>                                    We're working on a pretty solid
>>>>>                                    patch queue.
>>>>> 
>>>>>                                    independent patches
>>>>>                                    PARQUET-449:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/21
>>>>> 
>>>>>                                    interdependent patches (order to
>>>>>                                    apply patches)
>>>>>                                    PARQUET-437 (MOSTLY REVIEWED):
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/19
>>>>> 
>>>>>                                    PARQUET-418:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/18
>>>>>                                    PARQUET-434:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/20
>>>>>                                    PARQUET-433:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/22
>>>>>                                    PARQUET-451 & PARQUET-453:
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/23
>>>>> 
>>>>>                                    PARQUET-428 (needs to be rebased on
>>>>>                                    top of PARQUET-433):
>>>>> 
>>>>> https://github.com/apache/parquet-cpp/pull/24
>>>>> 
>>>>>                                    I'm going to take a breather and
>>>>>                                    work on some other things this
>>>>>                                    weekend,
>>>>>                                    but I'll be available for code
>>>>>                                    reviews and fixes to try to move
>>>>> along
>>>>>                                    this
>>>>>                                    patch queue.
>>>>> 
>>>>>                                    Thanks,
>>>>>                                    Wes
>>>>> 
>>>>>                                    On Fri, Jan 15, 2016 at 8:18 AM, Wes
>>>>>                                    McKinney <[email protected]
>>>>>                                    <mailto:[email protected]>> wrote:
>>>>> 
>>>>>                                        Great to meet you all!
>>>>> 
>>>>>                                        I've recently been collaborating
>>>>>                                        with the Apache Drill team to
>>>>> spin
>>>>>                                        out
>>>>>                                        the ValueVector columnar
>>>>>                                        in-memory data structure into a
>>>>> new
>>>>>                                        standalone
>>>>>                                        project that will be called
>>>>>                                        Arrow [1] [2]. A brief summary
>>>>> of
>>>>>                                        Arrow/ValueVectors is that it
>>>>>                                        permits O(1) random access on
>>>>> nested
>>>>>                                        columnar
>>>>>                                        structures and is efficient for
>>>>>                                        projections and scans in a
>>>>> columnar
>>>>>                                        SQL
>>>>>                                        setting.
>>>>> 
>>>>>                                        I'm very interested in making
>>>>>                                        Parquet read/write support
>>>>>                                        available to
>>>>>                                        Python programmers via C/C++
>>>>>                                        extensions, so I'm going to be
>>>>>                                        working
>>>>>                                        the
>>>>>                                        next few months on a
>>>>>                                        Parquet->Arrow->Python
>>>>>                                        toolchain, along with some
>>>>>                                        tools to manipulate tables
>>>>>                                        in-memory columnar data in the
>>>>>                                        style of
>>>>>                                        Python's
>>>>>                                        pandas library.
>>>>> 
>>>>>                                        I will propose patches as needed
>>>>>                                        to parquet-cpp to improve its
>>>>>                                        performance
>>>>>                                        and add functionality for
>>>>>                                        writing Parquet files as well.
>>>>> The
>>>>>                                        details of
>>>>>                                        converting to/from Parquet's
>>>>>                                        repetition/definition level
>>>>>                                        representation of
>>>>>                                        nested data will stay separate
>>>>>                                        in the arrow-parquet adapter
>>>>> code.
>>>>> 
>>>>>                                        cheers,
>>>>>                                        Wes
>>>>> 
>>>>>                                        [1]:
>>>>> 
>>>>> http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
>>>>> 
>>>>> 
>>>>>                                        [2]:
>>>>> 
>>>>> http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490
>>>>> 
>>>>> 
>>>>>                                        On Fri, Jan 15, 2016 at 1:22 AM,
>>>>>                                        Mickaël Lacour
>>>>>                                        <[email protected]
>>>>>                                        <mailto:[email protected]>>
>>>>>                                        wrote:
>>>>> 
>>>>>                                            Hi,
>>>>> 
>>>>>                                            I'm very interested in this
>>>>>                                            subject because I would like
>>>>>                                            to export
>>>>>                                            parquet data from HDFS to
>>>>>                                            Vertica (using VSQL).
>>>>>                                            I'm planning to work on it
>>>>>                                            next quarter, but I will be
>>>>>                                            very happy to
>>>>>                                            help
>>>>>                                            you on this subject (review,
>>>>>                                            testing).
>>>>> 
>>>>>                                            Have a nice day,
>>>>>                                            --
>>>>>                                            Mickaël Lacour
>>>>>                                            Senior Software Engineer
>>>>>                                            Analytics Infrastructure
>>>>>                                            team @Scalability
>>>>> 
>>>>> 
>>>>> ________________________________________
>>>>>                                            From: Walkauskas, Stephen
>>>>>                                            Gregory (Vertica)
>>>>>                                            <[email protected]
>>>>> 
>>>>> <mailto:[email protected]>>
>>>>>                                            Sent: Thursday, January 14,
>>>>>                                            2016 3:23 PM
>>>>>                                            To: Sandryhaila, Aliaksei;
>>>>>                                            [email protected]
>>>>> 
>>>>> <mailto:[email protected]>;
>>>>>                                            Majeti, Deepak;
>>>>>                                            [email protected]
>>>>>                                            <mailto:[email protected]>;
>>>>> 
>>>>>                                            Wes McKinney
>>>>>                                            Subject: Re: Parquet-cpp
>>>>> 
>>>>>                                            Yes, thanks for the
>>>>>                                            introduction Julien.
>>>>> 
>>>>>                                            Nong and Wes,
>>>>> 
>>>>>                                            It'd be interesting to know
>>>>>                                            your goals for parquet-cpp.
>>>>> 
>>>>>                                            The Vertica database already
>>>>>                                            supports optimized reads of
>>>>>                                            ORC files
>>>>>                                            (fast
>>>>>                                            c++ parser, predicate
>>>>>                                            pushdown, columns selection
>>>>>                                            etc). We'd like
>>>>>                                            to do
>>>>>                                            the same for parquet.
>>>>> 
>>>>>                                            Cheers,
>>>>>                                            Stephen
>>>>> 
>>>>>                                            On 01/13/2016 05:53 PM,
>>>>>                                            Sandryhaila, Aliaksei wrote:
>>>>> 
>>>>>                                                Thank you for the
>>>>>                                                introduction, Julien!
>>>>> 
>>>>>                                                Hello Nong and Wes,
>>>>> 
>>>>>                                                Stephen, Deepak and I
>>>>>                                                are developing a C++
>>>>>                                                library to support
>>>>>                                                Parquet in
>>>>>                                                Vertica RDBMS. We are
>>>>>                                                using Parquet-cpp as a
>>>>>                                                starting point and are
>>>>>                                                expanding its
>>>>>                                                functionality as well as
>>>>>                                                improving it and fixing
>>>>>                                                bugs. We
>>>>>                                                would like to contribute
>>>>>                                                these improvements back
>>>>>                                                to the open-source
>>>>>                                                community. We plan to do
>>>>>                                                this through the usual
>>>>>                                                process of creating
>>>>>                                                jiras that justify and
>>>>>                                                explain a code change,
>>>>>                                                and then submitting
>>>>>                                                pull
>>>>>                                                requests. We look
>>>>>                                                forward to working with
>>>>>                                                you on Parquet-cpp and
>>>>> to
>>>>>                                                your
>>>>>                                                feedback and
>>>>> suggestions.
>>>>> 
>>>>>                                                Best regards,
>>>>>                                                Aliaksei.
>>>>> 
>>>>> 
>>>>>                                                On 01/13/2016 02:54 PM,
>>>>>                                                Julien Le Dem wrote:
>>>>> 
>>>>>                                                    Hello Nong, Wes,
>>>>>                                                    Stephen, Deepak and
>>>>>                                                    Aliaksei
>>>>>                                                    I wanted to
>>>>>                                                    introduce you to
>>>>>                                                    each other as you
>>>>>                                                    are all looking at
>>>>>                                                    Parquet-cpp.
>>>>> 
>>>>>                                                    I'd recommend
>>>>>                                                    opening JIRAs in the
>>>>>                                                    parquet-cpp
>>>>> component to
>>>>> 
>>>>>                                            collaborate (I
>>>>> 
>>>>>                                                    see you already
>>>>>                                                    doing this):
>>>>> 
>>>>> 
>>>>> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
>>>>> 
>>>>> 
>>>>>                                                    Nong is a committer
>>>>>                                                    and can merged pull
>>>>>                                                    requests (he also
>>>>>                                                    understands
>>>>> 
>>>>>                                            that
>>>>> 
>>>>>                                                    code base very
>>>>> well).
>>>>>                                                    Other committer can
>>>>>                                                    too, feel free to
>>>>>                                                    ping us if you need
>>>>> help
>>>>>                                                    Obviously, you don't
>>>>>                                                    need to be a
>>>>>                                                    committer to give
>>>>>                                                    others reviews
>>>>>                                                    (you
>>>>>                                                    just need one to
>>>>>                                                    approve and merge).
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>                    --
>>>>>                    Ryan Blue
>>>>>                    Software Engineer
>>>>>                    Cloudera, Inc.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>                --
>>>>>                Julien
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>            --
>>>>>            Julien
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Julien
>> 
>>

Re: Parquet-cpp

Reply via email to