hi Deepak,

responses inline

On Sun, Jul 29, 2018 at 10:44 PM, Deepak Majeti <majeti.dee...@gmail.com> wrote:
> I dislike the current build system complications as well.
>
> However, in my opinion, combining the code bases will severely impact the
> progress of the parquet-cpp project and implicitly the progress of the
> entire parquet project.
> Combining would have made much more sense if parquet-cpp is a mature
> project and codebase.  But parquet-cpp (and the entire parquet project) is
> evolving continuously with new features being added including bloom
> filters,  column encryption, and indexes.
>

I don't see why parquet-cpp development would be impacted in a
negative way. In fact, I've argued exactly the opposite. Can you
explain in more detail why you think this would be the case? If
anything, parquet-cpp would benefit from more mature and better
maintained developer infrastructure.

Here's the project shortlog:

$ git shortlog -sn 08acdf6bfe3cd160ffe19b79bbded2bdc3f7bd62..master
   145  Wes McKinney
   109  Uwe L. Korn
    53  Deepak Majeti
    38  Korn, Uwe
    36  Nong Li
    12  Kouhei Sutou
    10  Max Risuhin
     9  Antoine Pitrou
     8  rip.nsk
     6  Phillip Cloud
     6  Xianjin YE
     5  Aliaksei Sandryhaila
     4  Thomas Sanchez
     3  Artem Tarasov
     3  Joshua Storck
     3  Lars Volker
     3  fscheibner
     3  revaliu
     2  Itai Incze
     2  Kalon Mills
     2  Marc Vertes
     2  Mike Trinkala
     2  Philipp Hoch
     1  Alec Posney
     1  Christopher C. Aycock
     1  Colin Nichols
     1  Dmitry Bushev
     1  Eric Daniel
     1  Fabrizio Fabbri
     1  Florian Scheibner
     1  Jaguar Xiong
     1  Julien Lafaye
     1  Julius Neuffer
     1  Kashif Rasul
     1  Rene Sugar
     1  Robert Gruener
     1  Toby Shaw
     1  William Forson
     1  Yue Chen
     1  thamht4190

Out of these, I know for a fact that at least the following
contributed to parquet-cpp as a result of their involvement with
Apache Arrow:

   145  Wes McKinney
   109  Uwe L. Korn
    38  Korn, Uwe
    12  Kouhei Sutou
    10  Max Risuhin
     9  Antoine Pitrou
     6  Phillip Cloud
     3  Joshua Storck
     1  Christopher C. Aycock
     1  Rene Sugar
     1  Robert Gruener

This is ~70% of commits

> If the two code bases merged, it will be much more difficult to contribute
> to the parquet-cpp project since now Arrow bindings have to be supported as
> well. Please correct me if I am wrong here.

I don't see why this would be true. The people above are already
supporting these bindings (which are pretty isolated to the symbols in
the parquet::arrow namespace), and patches not having to do with the
Arrow columnar data structures would not be affected.

Because of the arguments I made in my first e-mail, it will be less
work for the developers working on both projects to maintain the
interfaces. Currently, it is necessary to make patches to multiple
projects to improve APIs and fix bugs in many cases.

>
> Out of the two evils, I think handling the build system, packaging
> duplication is much more manageable since they are quite stable at this
> point.

We've been talking about this for a long time and no concrete and
actionable solution has come forward.

>
> Regarding "* API changes cause awkward release coordination issues between
> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> changes needed) as and when Arrow is released?

The central issue is that changes frequently require changes to
multiple codebases and cross-repo CI to verify patches jointly is not
really possible

>
> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> converting between Arrow columnar memory format and Parquet". Can this be
> moved to the Arrow project and expose the more stable low-level APIs in
> parquet-cpp?

The parts of Parquet that do not interact with the Arrow columnar
format still use Arrow platform APIs (IO, memory management,
compression, algorithms, etc.). We still would therefore have a
circular dependency, though some parts (e.g. changes in the
parquet::arrow layer) might be easier

- Wes

>
> I am also curious if the Arrow and Parquet Java implementations have
> similar API compatibility issues.

Parquet-Java is pretty different:

* On the plus side, the built-in Java platform solves some of the
problems we have addressed in the Arrow platform APIs. Note that Arrow
hasn't reinvented any wheels here or failed to use tools available in
the C++ standard library or Boost -- if you look at major Google
codebases like TensorFlow they have developed nearly identical
platform APIs to solve the same problems

* Parquet-Java depends on Hadoop platform APIs which has caused
problems for other Java projects which wish to read and write Parquet
files but do not use Hadoop (e.g. they store data in S3)

>
>
> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi folks,
>>
>> We've been struggling for quite some time with the development
>> workflow between the Arrow and Parquet C++ (and Python) codebases.
>>
>> To explain the root issues:
>>
>> * parquet-cpp depends on "platform code" in Apache Arrow; this
>> includes file interfaces, memory management, miscellaneous algorithms
>> (e.g. dictionary encoding), etc. Note that before this "platform"
>> dependency was introduced, there was significant duplicated code
>> between these codebases and incompatible abstract interfaces for
>> things like files
>>
>> * we maintain a Arrow conversion code in parquet-cpp for converting
>> between Arrow columnar memory format and Parquet
>>
>> * we maintain Python bindings for parquet-cpp + Arrow interop in
>> Apache Arrow. This introduces a circular dependency into our CI.
>>
>> * Substantial portions of our CMake build system and related tooling
>> are duplicated between the Arrow and Parquet repos
>>
>> * API changes cause awkward release coordination issues between Arrow
>> and Parquet
>>
>> I believe the best way to remedy the situation is to adopt a
>> "Community over Code" approach and find a way for the Parquet and
>> Arrow C++ development communities to operate out of the same code
>> repository, i.e. the apache/arrow git repository.
>>
>> This would bring major benefits:
>>
>> * Shared CMake build infrastructure, developer tools, and CI
>> infrastructure (Parquet is already being built as a dependency in
>> Arrow's CI systems)
>>
>> * Share packaging and release management infrastructure
>>
>> * Reduce / eliminate problems due to API changes (where we currently
>> introduce breakage into our CI workflow when there is a breaking /
>> incompatible change)
>>
>> * Arrow releases would include a coordinated snapshot of the Parquet
>> implementation as it stands
>>
>> Continuing with the status quo has become unsatisfactory to me and as
>> a result I've become less motivated to work on the parquet-cpp
>> codebase.
>>
>> The only Parquet C++ committer who is not an Arrow committer is Deepak
>> Majeti. I think the issue of commit privileges could be resolved
>> without too much difficulty or time.
>>
>> I also think if it is truly necessary that the Apache Parquet
>> community could create release scripts to cut a miniml versioned
>> Apache Parquet C++ release if that is deemed truly necessary.
>>
>> I know that some people are wary of monorepos and megaprojects, but as
>> an example TensorFlow is at least 10 times as large of a projects in
>> terms of LOCs and number of different platform components, and it
>> seems to be getting along just fine. I think we should be able to work
>> together as a community to function just as well.
>>
>> Interested in the opinions of others, and any other ideas for
>> practical solutions to the above problems.
>>
>> Thanks,
>> Wes
>>
>
>
> --
> regards,
> Deepak Majeti

Reply via email to