I don't have a direct stake in this beyond wanting to see Parquet be successful, but I thought I'd give my two cents.
For me, the thing that makes the biggest difference in contributing to a new codebase is the number of steps in the workflow for writing, testing, posting and iterating on a commit and also the number of opportunities for missteps. The size of the repo and build/test times matter but are secondary so long as the workflow is simple and reliable. I don't really know what the current state of things is, but it sounds like it's not as simple as check out -> build -> test if you're doing a cross-repo change. Circular dependencies are a real headache. On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi, > > On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <majeti.dee...@gmail.com> > wrote: > > I think the circular dependency can be broken if we build a new library > for > > the platform code. This will also make it easy for other projects such as > > ORC to use it. > > I also remember your proposal a while ago of having a separate project > for > > the platform code. That project can live in the arrow repo. However, one > > has to clone the entire apache arrow repo but can just build the platform > > code. This will be temporary until we can find a new home for it. > > > > The dependency will look like: > > libarrow(arrow core / bindings) <- libparquet (parquet core) <- > > libplatform(platform api) > > > > CI workflow will clone the arrow project twice, once for the platform > > library and once for the arrow-core/bindings library. > > This seems like an interesting proposal; the best place to work toward > this goal (if it is even possible; the build system interactions and > ASF release management are the hard problems) is to have all of the > code in a single repository. ORC could already be using Arrow if it > wanted, but the ORC contributors aren't active in Arrow. > > > > > There is no doubt that the collaborations between the Arrow and Parquet > > communities so far have been very successful. > > The reason to maintain this relationship moving forward is to continue to > > reap the mutual benefits. > > We should continue to take advantage of sharing code as well. However, I > > don't see any code sharing opportunities between arrow-core and the > > parquet-core. Both have different functions. > > I think you mean the Arrow columnar format. The Arrow columnar format > is only one part of a project that has become quite large already > (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development- > platform-for-inmemory-data-105427919). > > > > > We are at a point where the parquet-cpp public API is pretty stable. We > > already passed that difficult stage. My take at arrow and parquet is to > > keep them nimble since we can. > > I believe that parquet-core has progress to make yet ahead of it. We > have done little work in asynchronous IO and concurrency which would > yield both improved read and write throughput. This aligns well with > other concurrency and async-IO work planned in the Arrow platform. I > believe that more development will happen on parquet-core once the > development process issues are resolved by having a single codebase, > single build system, and a single CI framework. > > I have some gripes about design decisions made early in parquet-cpp, > like the use of C++ exceptions. So while "stability" is a reasonable > goal I think we should still be open to making significant changes in > the interest of long term progress. > > Having now worked on these projects for more than 2 and a half years > and the most frequent contributor to both codebases, I'm sadly far > past the "breaking point" and not willing to continue contributing in > a significant way to parquet-cpp if the projects remained structured > as they are now. It's hampering progress and not serving the > community. > > - Wes > > > > > > > > > > > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > >> > The current Arrow adaptor code for parquet should live in the arrow > >> repo. That will remove a majority of the dependency issues. Joshua's > work > >> would not have been blocked in parquet-cpp if that adapter was in the > arrow > >> repo. This will be similar to the ORC adaptor. > >> > >> This has been suggested before, but I don't see how it would alleviate > >> any issues because of the significant dependencies on other parts of > >> the Arrow codebase. What you are proposing is: > >> > >> - (Arrow) arrow platform > >> - (Parquet) parquet core > >> - (Arrow) arrow columnar-parquet adapter interface > >> - (Arrow) Python bindings > >> > >> To make this work, somehow Arrow core / libarrow would have to be > >> built before invoking the Parquet core part of the build system. You > >> would need to pass dependent targets across different CMake build > >> systems; I don't know if it's possible (I spent some time looking into > >> it earlier this year). This is what I meant by the lack of a "concrete > >> and actionable plan". The only thing that would really work would be > >> for the Parquet core to be "included" in the Arrow build system > >> somehow rather than using ExternalProject. Currently Parquet builds > >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build > >> system because it's only depended upon by the Python bindings. > >> > >> And even if a solution could be devised, it would not wholly resolve > >> the CI workflow issues. > >> > >> You could make Parquet completely independent of the Arrow codebase, > >> but at that point there is little reason to maintain a relationship > >> between the projects or their communities. We have spent a great deal > >> of effort refactoring the two projects to enable as much code sharing > >> as there is now. > >> > >> - Wes > >> > >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <wesmck...@gmail.com> > wrote: > >> >> If you still strongly feel that the only way forward is to clone the > >> parquet-cpp repo and part ways, I will withdraw my concern. Having two > >> parquet-cpp repos is no way a better approach. > >> > > >> > Yes, indeed. In my view, the next best option after a monorepo is to > >> > fork. That would obviously be a bad outcome for the community. > >> > > >> > It doesn't look like I will be able to convince you that a monorepo is > >> > a good idea; what I would ask instead is that you be willing to give > >> > it a shot, and if it turns out in the way you're describing (which I > >> > don't think it will) then I suggest that we fork at that point. > >> > > >> > - Wes > >> > > >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti < > majeti.dee...@gmail.com> > >> wrote: > >> >> Wes, > >> >> > >> >> Unfortunately, I cannot show you any practical fact-based problems > of a > >> >> non-existent Arrow-Parquet mono-repo. > >> >> Bringing in related Apache community experiences are more meaningful > >> than > >> >> how mono-repos work at Google and other big organizations. > >> >> We solely depend on volunteers and cannot hire full-time developers. > >> >> You are very well aware of how difficult it has been to find more > >> >> contributors and maintainers for Arrow. parquet-cpp already has a low > >> >> contribution rate to its core components. > >> >> > >> >> We should target to ensure that new volunteers who want to contribute > >> >> bug-fixes/features should spend the least amount of time in figuring > out > >> >> the project repo. We can never come up with an automated build system > >> that > >> >> caters to every possible environment. > >> >> My only concern is if the mono-repo will make it harder for new > >> developers > >> >> to work on parquet-cpp core just due to the additional code, build > and > >> test > >> >> dependencies. > >> >> I am not saying that the Arrow community/committers will be less > >> >> co-operative. > >> >> I just don't think the mono-repo structure model will be sustainable > in > >> an > >> >> open source community unless there are long-term vested interests. We > >> can't > >> >> predict that. > >> >> > >> >> The current circular dependency problems between Arrow and Parquet > is a > >> >> major problem for the community and it is important. > >> >> > >> >> The current Arrow adaptor code for parquet should live in the arrow > >> repo. > >> >> That will remove a majority of the dependency issues. > >> >> Joshua's work would not have been blocked in parquet-cpp if that > adapter > >> >> was in the arrow repo. This will be similar to the ORC adaptor. > >> >> > >> >> The platform API code is pretty stable at this point. Minor changes > in > >> the > >> >> future to this code should not be the main reason to combine the > arrow > >> >> parquet repos. > >> >> > >> >> " > >> >> *I question whether it's worth the community's time long term to > wear* > >> >> > >> >> > >> >> *ourselves out defining custom "ports" / virtual interfaces in > >> eachlibrary > >> >> to plug components together rather than utilizing commonplatform > APIs.*" > >> >> > >> >> My answer to your question below would be "Yes". > Modularity/separation > >> is > >> >> very important in an open source community where priorities of > >> contributors > >> >> are often short term. > >> >> The retention is low and therefore the acquisition costs should be > low > >> as > >> >> well. This is the community over code approach according to me. Minor > >> code > >> >> duplication is not a deal breaker. > >> >> ORC, Parquet, Arrow, etc. are all different components in the big > data > >> >> space serving their own functions. > >> >> > >> >> If you still strongly feel that the only way forward is to clone the > >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having > two > >> >> parquet-cpp repos is no way a better approach. > >> >> > >> >> > >> >> > >> >> > >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <wesmck...@gmail.com> > >> wrote: > >> >> > >> >>> @Antoine > >> >>> > >> >>> > By the way, one concern with the monorepo approach: it would > slightly > >> >>> increase Arrow CI times (which are already too large). > >> >>> > >> >>> A typical CI run in Arrow is taking about 45 minutes: > >> >>> https://travis-ci.org/apache/arrow/builds/410119750 > >> >>> > >> >>> Parquet run takes about 28 > >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208 > >> >>> > >> >>> Inevitably we will need to create some kind of bot to run certain > >> >>> builds on-demand based on commit / PR metadata or on request. > >> >>> > >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be > >> >>> made substantially shorter by moving some of the slower parts (like > >> >>> the Python ASV benchmarks) from being tested every-commit to nightly > >> >>> or on demand. Using ASAN instead of valgrind in Travis would also > >> >>> improve build times (valgrind build could be moved to a nightly > >> >>> exhaustive test run) > >> >>> > >> >>> - Wes > >> >>> > >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com > > > >> >>> wrote: > >> >>> >> I would like to point out that arrow's use of orc is a great > >> example of > >> >>> how it would be possible to manage parquet-cpp as a separate > codebase. > >> That > >> >>> gives me hope that the projects could be managed separately some > day. > >> >>> > > >> >>> > Well, I don't know that ORC is the best example. The ORC C++ > codebase > >> >>> > features several areas of duplicated logic which could be > replaced by > >> >>> > components from the Arrow platform for better platform-wide > >> >>> > interoperability: > >> >>> > > >> >>> > > >> >>> > >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ > orc/OrcFile.hh#L37 > >> >>> > > >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh > >> >>> > > >> >>> > >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ > orc/MemoryPool.hh > >> >>> > > >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh > >> >>> > > >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/ > OutputStream.hh > >> >>> > > >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of > >> >>> > bugs that we had to fix in Arrow's build system to prevent them > from > >> >>> > leaking to third party linkers when statically linked (ORC is only > >> >>> > available for static linking at the moment AFAIK). > >> >>> > > >> >>> > I question whether it's worth the community's time long term to > wear > >> >>> > ourselves out defining custom "ports" / virtual interfaces in each > >> >>> > library to plug components together rather than utilizing common > >> >>> > platform APIs. > >> >>> > > >> >>> > - Wes > >> >>> > > >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck < > >> joshuasto...@gmail.com> > >> >>> wrote: > >> >>> >> You're point about the constraints of the ASF release process are > >> well > >> >>> >> taken and as a developer who's trying to work in the current > >> >>> environment I > >> >>> >> would be much happier if the codebases were merged. The main > issues > >> I > >> >>> worry > >> >>> >> about when you put codebases like these together are: > >> >>> >> > >> >>> >> 1. The delineation of API's become blurred and the code becomes > too > >> >>> coupled > >> >>> >> 2. Release of artifacts that are lower in the dependency tree are > >> >>> delayed > >> >>> >> by artifacts higher in the dependency tree > >> >>> >> > >> >>> >> If the project/release management is structured well and someone > >> keeps > >> >>> an > >> >>> >> eye on the coupling, then I don't have any concerns. > >> >>> >> > >> >>> >> I would like to point out that arrow's use of orc is a great > >> example of > >> >>> how > >> >>> >> it would be possible to manage parquet-cpp as a separate > codebase. > >> That > >> >>> >> gives me hope that the projects could be managed separately some > >> day. > >> >>> >> > >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney < > wesmck...@gmail.com> > >> >>> wrote: > >> >>> >> > >> >>> >>> hi Josh, > >> >>> >>> > >> >>> >>> > I can imagine use cases for parquet that don't involve arrow > and > >> >>> tying > >> >>> >>> them together seems like the wrong choice. > >> >>> >>> > >> >>> >>> Apache is "Community over Code"; right now it's the same people > >> >>> >>> building these projects -- my argument (which I think you agree > >> with?) > >> >>> >>> is that we should work more closely together until the community > >> grows > >> >>> >>> large enough to support larger-scope process than we have now. > As > >> >>> >>> you've seen, our process isn't serving developers of these > >> projects. > >> >>> >>> > >> >>> >>> > I also think build tooling should be pulled into its own > >> codebase. > >> >>> >>> > >> >>> >>> I don't see how this can possibly be practical taking into > >> >>> >>> consideration the constraints imposed by the combination of the > >> GitHub > >> >>> >>> platform and the ASF release process. I'm all for being > idealistic, > >> >>> >>> but right now we need to be practical. Unless we can devise a > >> >>> >>> practical procedure that can accommodate at least 1 patch per > day > >> >>> >>> which may touch both code and build system simultaneously > without > >> >>> >>> being a hindrance to contributor or maintainer, I don't see how > we > >> can > >> >>> >>> move forward. > >> >>> >>> > >> >>> >>> > That being said, I think it makes sense to merge the codebases > >> in the > >> >>> >>> short term with the express purpose of separating them in the > near > >> >>> term. > >> >>> >>> > >> >>> >>> I would agree but only if separation can be demonstrated to be > >> >>> >>> practical and result in net improvements in productivity and > >> community > >> >>> >>> growth. I think experience has clearly demonstrated that the > >> current > >> >>> >>> separation is impractical, and is causing problems. > >> >>> >>> > >> >>> >>> Per Julian's and Ted's comments, I think we need to consider > >> >>> >>> development process and ASF releases separately. My argument is > as > >> >>> >>> follows: > >> >>> >>> > >> >>> >>> * Monorepo for development (for practicality) > >> >>> >>> * Releases structured according to the desires of the PMCs > >> >>> >>> > >> >>> >>> - Wes > >> >>> >>> > >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck < > >> joshuasto...@gmail.com > >> >>> > > >> >>> >>> wrote: > >> >>> >>> > I recently worked on an issue that had to be implemented in > >> >>> parquet-cpp > >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow > >> (ARROW-2585, > >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and > >> hard to > >> >>> work > >> >>> >>> > with. For example, I still have a PR open in parquet-cpp > >> (created on > >> >>> May > >> >>> >>> > 10) because of a PR that it depended on in arrow that was > >> recently > >> >>> >>> merged. > >> >>> >>> > I couldn't even address any CI issues in the PR because the > >> change in > >> >>> >>> arrow > >> >>> >>> > was not yet in master. In a separate PR, I changed the > >> >>> >>> run_clang_format.py > >> >>> >>> > script in the arrow project only to find out later that there > >> was an > >> >>> >>> exact > >> >>> >>> > copy of it in parquet-cpp. > >> >>> >>> > > >> >>> >>> > However, I don't think merging the codebases makes sense in > the > >> long > >> >>> >>> term. > >> >>> >>> > I can imagine use cases for parquet that don't involve arrow > and > >> >>> tying > >> >>> >>> them > >> >>> >>> > together seems like the wrong choice. There will be other > formats > >> >>> that > >> >>> >>> > arrow needs to support that will be kept separate (e.g. - > Orc), > >> so I > >> >>> >>> don't > >> >>> >>> > see why parquet should be special. I also think build tooling > >> should > >> >>> be > >> >>> >>> > pulled into its own codebase. GNU has had a long history of > >> >>> developing > >> >>> >>> open > >> >>> >>> > source C/C++ projects that way and made projects like > >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a > >> good > >> >>> >>> > counter-example since there have been lots of successful open > >> source > >> >>> >>> > projects that have used nightly build systems that pinned > >> versions of > >> >>> >>> > dependent software. > >> >>> >>> > > >> >>> >>> > That being said, I think it makes sense to merge the codebases > >> in the > >> >>> >>> short > >> >>> >>> > term with the express purpose of separating them in the near > >> term. > >> >>> My > >> >>> >>> > reasoning is as follows. By putting the codebases together, > you > >> can > >> >>> more > >> >>> >>> > easily delineate the boundaries between the API's with a > single > >> PR. > >> >>> >>> Second, > >> >>> >>> > it will force the build tooling to converge instead of > diverge, > >> >>> which has > >> >>> >>> > already happened. Once the boundaries and tooling have been > >> sorted > >> >>> out, > >> >>> >>> it > >> >>> >>> > should be easy to separate them back into their own codebases. > >> >>> >>> > > >> >>> >>> > If the codebases are merged, I would ask that the C++ > codebases > >> for > >> >>> arrow > >> >>> >>> > be separated from other languages. Looking at it from the > >> >>> perspective of > >> >>> >>> a > >> >>> >>> > parquet-cpp library user, having a dependency on Java is a > large > >> tax > >> >>> to > >> >>> >>> pay > >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the > >> 0.10.0 > >> >>> >>> > release of arrow, many of which were holding up the release. I > >> hope > >> >>> that > >> >>> >>> > seems like a reasonable compromise, and I think it will help > >> reduce > >> >>> the > >> >>> >>> > complexity of the build/release tooling. > >> >>> >>> > > >> >>> >>> > > >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning < > >> ted.dunn...@gmail.com> > >> >>> >>> wrote: > >> >>> >>> > > >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney < > >> wesmck...@gmail.com> > >> >>> >>> wrote: > >> >>> >>> >> > >> >>> >>> >> > > >> >>> >>> >> > > The community will be less willing to accept large > >> >>> >>> >> > > changes that require multiple rounds of patches for > >> stability > >> >>> and > >> >>> >>> API > >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS > >> >>> community > >> >>> >>> took > >> >>> >>> >> a > >> >>> >>> >> > > significantly long time for the very same reason. > >> >>> >>> >> > > >> >>> >>> >> > Please don't use bad experiences from another open source > >> >>> community as > >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't > go > >> the > >> >>> way > >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct > community > >> which > >> >>> >>> >> > happens to operate under a similar open governance model. > >> >>> >>> >> > >> >>> >>> >> > >> >>> >>> >> There are some more radical and community building options as > >> well. > >> >>> Take > >> >>> >>> >> the subversion project as a precedent. With subversion, any > >> Apache > >> >>> >>> >> committer can request and receive a commit bit on some large > >> >>> fraction of > >> >>> >>> >> subversion. > >> >>> >>> >> > >> >>> >>> >> So why not take this a bit further and give every parquet > >> committer > >> >>> a > >> >>> >>> >> commit bit in Arrow? Or even make them be first class > >> committers in > >> >>> >>> Arrow? > >> >>> >>> >> Possibly even make it policy that every Parquet committer who > >> asks > >> >>> will > >> >>> >>> be > >> >>> >>> >> given committer status in Arrow. > >> >>> >>> >> > >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet > >> committers > >> >>> >>> can't be > >> >>> >>> >> worried at that point whether their patches will get merged; > >> they > >> >>> can > >> >>> >>> just > >> >>> >>> >> merge them. Arrow shouldn't worry much about inviting in the > >> >>> Parquet > >> >>> >>> >> committers. After all, Arrow already depends a lot on > parquet so > >> >>> why not > >> >>> >>> >> invite them in? > >> >>> >>> >> > >> >>> >>> > >> >>> > >> >> > >> >> > >> >> -- > >> >> regards, > >> >> Deepak Majeti > >> > > > > > > -- > > regards, > > Deepak Majeti >