Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Julian Hyde Mon, 30 Jul 2018 17:12:38 -0700

I'm not going to comment on the design of the parquet-cpp module and whether it 
is “closer” to parquet or arrow.


But I do think Wes’s proposal is consistent with Apache policy. PMCs make 
releases and govern communities; they don’t exist to manage code bases, except 
as a means to the end of creating releases of known provenance. The Parquet PMC 
can continue to make parquet-cpp releases, and to end-users those releases will 
look the same as they do today, even if the code for those releases were to 
move to a different git repo in the ASF.

Julian



> On Jul 30, 2018, at 3:05 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> hi Deepak
> 
> On Mon, Jul 30, 2018 at 5:18 PM, Deepak Majeti <majeti.dee...@gmail.com> 
> wrote:
>> @Wes
>> My observation is that most of the parquet-cpp contributors you listed that
>> overlap with the Arrow community mainly contribute to the Arrow
>> bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
>> repo. Very few of them review/contribute patches to the parquet-cpp core.
>> 
> 
> So, what are you saying exactly, that some contributions or
> contributors to Apache Parquet matter more than others? I don't
> follow.
> 
> As a result of these individual's efforts, the parquet-cpp libraries
> are being installed well over 100,000 times per month on a single
> install path (Python) alone.
> 
>> I believe improvements to the parquet-cpp core will be negatively impacted
>> since merging the parquet-cpp and arrow-cpp repos will increase the barrier
>> of entry to new contributors interested in the parquet-cpp core. The
>> current extensions to the parquet-cpp core related to bloom-filters, and
>> column encryption are all being done by first-time contributors.
> 
> I don't understand why this would "increase the barrier of entry".
> Could you explain?
> 
> It is true that there would be more code in the codebase, but the
> build and test procedure would be no more complex. If anything,
> community productivity will be improved by having a more cohesive /
> centralized development platform (large amounts of code that Parquet
> depends on are in Apache Arrow already).
> 
>> 
>> If you believe there will be new interest in the parquet-cpp core with the
>> mono-repo approach, I am all up for it.
> 
> Yes, I believe that this change will result in more and higher quality
> code review to Parquet core changes and general improvements to
> developer productivity across the board. Developer productivity is
> what this is all about.
> 
> - Wes
> 
>> 
>> 
>> On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pcmor...@gmail.com> wrote:
>> 
>>> I do not claim to have insight into parquet-cpp development. However, from
>>> our experience developing Ray, I can say that the monorepo approach (for
>>> Ray) has improved things a lot. Before we tried various schemes to split
>>> the project into multiple repos, but the build system and test
>>> infrastructure duplications and overhead from synchronizing changes slowed
>>> development down significantly (and fixing bugs that touch the subrepos and
>>> the main repo is inconvenient).
>>> 
>>> Also the decision to put arrow and parquet-cpp into a common repo is
>>> independent of how tightly coupled the two projects are (and there could be
>>> a matrix entry in travis which tests that PRs keep them decoupled, or
>>> rather that they both just depend on a small common "base"). Google and
>>> Facebook demonstrate such independence by having many many projects in the
>>> same repo of course. It would be great if the open source community would
>>> move more into this direction too I think.
>>> 
>>> Best,
>>> Philipp.
>>> 
>>> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>> 
>>>> hi Donald,
>>>> 
>>>> This would make things worse, not better. Code changes routinely
>>>> involve changes to the build system, and so you could be talking about
>>>> having to making changes to 2 or 3 git repositories as the result of a
>>>> single new feature or bug fix. There isn't really a cross-repo CI
>>>> solution available
>>>> 
>>>> I've seen some approaches to the monorepo problem using multiple git
>>>> repositories, such as
>>>> 
>>>> https://github.com/twosigma/git-meta
>>>> 
>>>> Until something like this has first class support by the GitHub
>>>> platform and its CI services (Travis CI, Appveyor), I don't think it
>>>> will work for us.
>>>> 
>>>> - Wes
>>>> 
>>>> On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <donald.f...@gmail.com>
>>>> wrote:
>>>>> Could this work as each module gets configured as sub-git repots. Top
>>>> level
>>>>> build tool go into each sub-repo, pick the correct release version to
>>>> test.
>>>>> Tests in Python is dependent on cpp sub-repo to ensure the API still
>>>> pass.
>>>>> 
>>>>> This should be the best of both worlds, if sub-repo are supposed
>>> option.
>>>>> 
>>>>> --Donald E. Foss
>>>>> 
>>>>> On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> I dislike the current build system complications as well.
>>>>>> 
>>>>>> However, in my opinion, combining the code bases will severely impact
>>>> the
>>>>>> progress of the parquet-cpp project and implicitly the progress of the
>>>>>> entire parquet project.
>>>>>> Combining would have made much more sense if parquet-cpp is a mature
>>>>>> project and codebase.  But parquet-cpp (and the entire parquet
>>> project)
>>>> is
>>>>>> evolving continuously with new features being added including bloom
>>>>>> filters,  column encryption, and indexes.
>>>>>> 
>>>>>> If the two code bases merged, it will be much more difficult to
>>>> contribute
>>>>>> to the parquet-cpp project since now Arrow bindings have to be
>>>> supported as
>>>>>> well. Please correct me if I am wrong here.
>>>>>> 
>>>>>> Out of the two evils, I think handling the build system, packaging
>>>>>> duplication is much more manageable since they are quite stable at
>>> this
>>>>>> point.
>>>>>> 
>>>>>> Regarding "* API changes cause awkward release coordination issues
>>>> between
>>>>>> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
>>> API
>>>>>> changes needed) as and when Arrow is released?
>>>>>> 
>>>>>> Regarding "we maintain a Arrow conversion code in parquet-cpp for
>>>>>> converting between Arrow columnar memory format and Parquet". Can this
>>>> be
>>>>>> moved to the Arrow project and expose the more stable low-level APIs
>>> in
>>>>>> parquet-cpp?
>>>>>> 
>>>>>> I am also curious if the Arrow and Parquet Java implementations have
>>>>>> similar API compatibility issues.
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> hi folks,
>>>>>>> 
>>>>>>> We've been struggling for quite some time with the development
>>>>>>> workflow between the Arrow and Parquet C++ (and Python) codebases.
>>>>>>> 
>>>>>>> To explain the root issues:
>>>>>>> 
>>>>>>> * parquet-cpp depends on "platform code" in Apache Arrow; this
>>>>>>> includes file interfaces, memory management, miscellaneous
>>> algorithms
>>>>>>> (e.g. dictionary encoding), etc. Note that before this "platform"
>>>>>>> dependency was introduced, there was significant duplicated code
>>>>>>> between these codebases and incompatible abstract interfaces for
>>>>>>> things like files
>>>>>>> 
>>>>>>> * we maintain a Arrow conversion code in parquet-cpp for converting
>>>>>>> between Arrow columnar memory format and Parquet
>>>>>>> 
>>>>>>> * we maintain Python bindings for parquet-cpp + Arrow interop in
>>>>>>> Apache Arrow. This introduces a circular dependency into our CI.
>>>>>>> 
>>>>>>> * Substantial portions of our CMake build system and related tooling
>>>>>>> are duplicated between the Arrow and Parquet repos
>>>>>>> 
>>>>>>> * API changes cause awkward release coordination issues between
>>> Arrow
>>>>>>> and Parquet
>>>>>>> 
>>>>>>> I believe the best way to remedy the situation is to adopt a
>>>>>>> "Community over Code" approach and find a way for the Parquet and
>>>>>>> Arrow C++ development communities to operate out of the same code
>>>>>>> repository, i.e. the apache/arrow git repository.
>>>>>>> 
>>>>>>> This would bring major benefits:
>>>>>>> 
>>>>>>> * Shared CMake build infrastructure, developer tools, and CI
>>>>>>> infrastructure (Parquet is already being built as a dependency in
>>>>>>> Arrow's CI systems)
>>>>>>> 
>>>>>>> * Share packaging and release management infrastructure
>>>>>>> 
>>>>>>> * Reduce / eliminate problems due to API changes (where we currently
>>>>>>> introduce breakage into our CI workflow when there is a breaking /
>>>>>>> incompatible change)
>>>>>>> 
>>>>>>> * Arrow releases would include a coordinated snapshot of the Parquet
>>>>>>> implementation as it stands
>>>>>>> 
>>>>>>> Continuing with the status quo has become unsatisfactory to me and
>>> as
>>>>>>> a result I've become less motivated to work on the parquet-cpp
>>>>>>> codebase.
>>>>>>> 
>>>>>>> The only Parquet C++ committer who is not an Arrow committer is
>>> Deepak
>>>>>>> Majeti. I think the issue of commit privileges could be resolved
>>>>>>> without too much difficulty or time.
>>>>>>> 
>>>>>>> I also think if it is truly necessary that the Apache Parquet
>>>>>>> community could create release scripts to cut a miniml versioned
>>>>>>> Apache Parquet C++ release if that is deemed truly necessary.
>>>>>>> 
>>>>>>> I know that some people are wary of monorepos and megaprojects, but
>>> as
>>>>>>> an example TensorFlow is at least 10 times as large of a projects in
>>>>>>> terms of LOCs and number of different platform components, and it
>>>>>>> seems to be getting along just fine. I think we should be able to
>>> work
>>>>>>> together as a community to function just as well.
>>>>>>> 
>>>>>>> Interested in the opinions of others, and any other ideas for
>>>>>>> practical solutions to the above problems.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Wes
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> regards,
>>>>>> Deepak Majeti
>>>>>> 
>>>> 
>>> 
>> 
>> 
>> --
>> regards,
>> Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to