> If you still strongly feel that the only way forward is to clone the > parquet-cpp repo and part ways, I will withdraw my concern. Having two > parquet-cpp repos is no way a better approach.
Yes, indeed. In my view, the next best option after a monorepo is to fork. That would obviously be a bad outcome for the community. It doesn't look like I will be able to convince you that a monorepo is a good idea; what I would ask instead is that you be willing to give it a shot, and if it turns out in the way you're describing (which I don't think it will) then I suggest that we fork at that point. - Wes On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <majeti.dee...@gmail.com> wrote: > Wes, > > Unfortunately, I cannot show you any practical fact-based problems of a > non-existent Arrow-Parquet mono-repo. > Bringing in related Apache community experiences are more meaningful than > how mono-repos work at Google and other big organizations. > We solely depend on volunteers and cannot hire full-time developers. > You are very well aware of how difficult it has been to find more > contributors and maintainers for Arrow. parquet-cpp already has a low > contribution rate to its core components. > > We should target to ensure that new volunteers who want to contribute > bug-fixes/features should spend the least amount of time in figuring out > the project repo. We can never come up with an automated build system that > caters to every possible environment. > My only concern is if the mono-repo will make it harder for new developers > to work on parquet-cpp core just due to the additional code, build and test > dependencies. > I am not saying that the Arrow community/committers will be less > co-operative. > I just don't think the mono-repo structure model will be sustainable in an > open source community unless there are long-term vested interests. We can't > predict that. > > The current circular dependency problems between Arrow and Parquet is a > major problem for the community and it is important. > > The current Arrow adaptor code for parquet should live in the arrow repo. > That will remove a majority of the dependency issues. > Joshua's work would not have been blocked in parquet-cpp if that adapter > was in the arrow repo. This will be similar to the ORC adaptor. > > The platform API code is pretty stable at this point. Minor changes in the > future to this code should not be the main reason to combine the arrow > parquet repos. > > " > *I question whether it's worth the community's time long term to wear* > > > *ourselves out defining custom "ports" / virtual interfaces in eachlibrary > to plug components together rather than utilizing commonplatform APIs.*" > > My answer to your question below would be "Yes". Modularity/separation is > very important in an open source community where priorities of contributors > are often short term. > The retention is low and therefore the acquisition costs should be low as > well. This is the community over code approach according to me. Minor code > duplication is not a deal breaker. > ORC, Parquet, Arrow, etc. are all different components in the big data > space serving their own functions. > > If you still strongly feel that the only way forward is to clone the > parquet-cpp repo and part ways, I will withdraw my concern. Having two > parquet-cpp repos is no way a better approach. > > > > > On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <wesmck...@gmail.com> wrote: > >> @Antoine >> >> > By the way, one concern with the monorepo approach: it would slightly >> increase Arrow CI times (which are already too large). >> >> A typical CI run in Arrow is taking about 45 minutes: >> https://travis-ci.org/apache/arrow/builds/410119750 >> >> Parquet run takes about 28 >> https://travis-ci.org/apache/parquet-cpp/builds/410147208 >> >> Inevitably we will need to create some kind of bot to run certain >> builds on-demand based on commit / PR metadata or on request. >> >> The slowest build in Arrow (the Arrow C++/Python one) build could be >> made substantially shorter by moving some of the slower parts (like >> the Python ASV benchmarks) from being tested every-commit to nightly >> or on demand. Using ASAN instead of valgrind in Travis would also >> improve build times (valgrind build could be moved to a nightly >> exhaustive test run) >> >> - Wes >> >> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> >> I would like to point out that arrow's use of orc is a great example of >> how it would be possible to manage parquet-cpp as a separate codebase. That >> gives me hope that the projects could be managed separately some day. >> > >> > Well, I don't know that ORC is the best example. The ORC C++ codebase >> > features several areas of duplicated logic which could be replaced by >> > components from the Arrow platform for better platform-wide >> > interoperability: >> > >> > >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37 >> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh >> > >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh >> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh >> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh >> > >> > ORC's use of symbols from Protocol Buffers was actually a cause of >> > bugs that we had to fix in Arrow's build system to prevent them from >> > leaking to third party linkers when statically linked (ORC is only >> > available for static linking at the moment AFAIK). >> > >> > I question whether it's worth the community's time long term to wear >> > ourselves out defining custom "ports" / virtual interfaces in each >> > library to plug components together rather than utilizing common >> > platform APIs. >> > >> > - Wes >> > >> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <joshuasto...@gmail.com> >> wrote: >> >> You're point about the constraints of the ASF release process are well >> >> taken and as a developer who's trying to work in the current >> environment I >> >> would be much happier if the codebases were merged. The main issues I >> worry >> >> about when you put codebases like these together are: >> >> >> >> 1. The delineation of API's become blurred and the code becomes too >> coupled >> >> 2. Release of artifacts that are lower in the dependency tree are >> delayed >> >> by artifacts higher in the dependency tree >> >> >> >> If the project/release management is structured well and someone keeps >> an >> >> eye on the coupling, then I don't have any concerns. >> >> >> >> I would like to point out that arrow's use of orc is a great example of >> how >> >> it would be possible to manage parquet-cpp as a separate codebase. That >> >> gives me hope that the projects could be managed separately some day. >> >> >> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> >> >> >>> hi Josh, >> >>> >> >>> > I can imagine use cases for parquet that don't involve arrow and >> tying >> >>> them together seems like the wrong choice. >> >>> >> >>> Apache is "Community over Code"; right now it's the same people >> >>> building these projects -- my argument (which I think you agree with?) >> >>> is that we should work more closely together until the community grows >> >>> large enough to support larger-scope process than we have now. As >> >>> you've seen, our process isn't serving developers of these projects. >> >>> >> >>> > I also think build tooling should be pulled into its own codebase. >> >>> >> >>> I don't see how this can possibly be practical taking into >> >>> consideration the constraints imposed by the combination of the GitHub >> >>> platform and the ASF release process. I'm all for being idealistic, >> >>> but right now we need to be practical. Unless we can devise a >> >>> practical procedure that can accommodate at least 1 patch per day >> >>> which may touch both code and build system simultaneously without >> >>> being a hindrance to contributor or maintainer, I don't see how we can >> >>> move forward. >> >>> >> >>> > That being said, I think it makes sense to merge the codebases in the >> >>> short term with the express purpose of separating them in the near >> term. >> >>> >> >>> I would agree but only if separation can be demonstrated to be >> >>> practical and result in net improvements in productivity and community >> >>> growth. I think experience has clearly demonstrated that the current >> >>> separation is impractical, and is causing problems. >> >>> >> >>> Per Julian's and Ted's comments, I think we need to consider >> >>> development process and ASF releases separately. My argument is as >> >>> follows: >> >>> >> >>> * Monorepo for development (for practicality) >> >>> * Releases structured according to the desires of the PMCs >> >>> >> >>> - Wes >> >>> >> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuasto...@gmail.com >> > >> >>> wrote: >> >>> > I recently worked on an issue that had to be implemented in >> parquet-cpp >> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585, >> >>> > ARROW-2586). I found the circular dependencies confusing and hard to >> work >> >>> > with. For example, I still have a PR open in parquet-cpp (created on >> May >> >>> > 10) because of a PR that it depended on in arrow that was recently >> >>> merged. >> >>> > I couldn't even address any CI issues in the PR because the change in >> >>> arrow >> >>> > was not yet in master. In a separate PR, I changed the >> >>> run_clang_format.py >> >>> > script in the arrow project only to find out later that there was an >> >>> exact >> >>> > copy of it in parquet-cpp. >> >>> > >> >>> > However, I don't think merging the codebases makes sense in the long >> >>> term. >> >>> > I can imagine use cases for parquet that don't involve arrow and >> tying >> >>> them >> >>> > together seems like the wrong choice. There will be other formats >> that >> >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I >> >>> don't >> >>> > see why parquet should be special. I also think build tooling should >> be >> >>> > pulled into its own codebase. GNU has had a long history of >> developing >> >>> open >> >>> > source C/C++ projects that way and made projects like >> >>> > autoconf/automake/make to support them. I don't think CI is a good >> >>> > counter-example since there have been lots of successful open source >> >>> > projects that have used nightly build systems that pinned versions of >> >>> > dependent software. >> >>> > >> >>> > That being said, I think it makes sense to merge the codebases in the >> >>> short >> >>> > term with the express purpose of separating them in the near term. >> My >> >>> > reasoning is as follows. By putting the codebases together, you can >> more >> >>> > easily delineate the boundaries between the API's with a single PR. >> >>> Second, >> >>> > it will force the build tooling to converge instead of diverge, >> which has >> >>> > already happened. Once the boundaries and tooling have been sorted >> out, >> >>> it >> >>> > should be easy to separate them back into their own codebases. >> >>> > >> >>> > If the codebases are merged, I would ask that the C++ codebases for >> arrow >> >>> > be separated from other languages. Looking at it from the >> perspective of >> >>> a >> >>> > parquet-cpp library user, having a dependency on Java is a large tax >> to >> >>> pay >> >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0 >> >>> > release of arrow, many of which were holding up the release. I hope >> that >> >>> > seems like a reasonable compromise, and I think it will help reduce >> the >> >>> > complexity of the build/release tooling. >> >>> > >> >>> > >> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <ted.dunn...@gmail.com> >> >>> wrote: >> >>> > >> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <wesmck...@gmail.com> >> >>> wrote: >> >>> >> >> >>> >> > >> >>> >> > > The community will be less willing to accept large >> >>> >> > > changes that require multiple rounds of patches for stability >> and >> >>> API >> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS >> community >> >>> took >> >>> >> a >> >>> >> > > significantly long time for the very same reason. >> >>> >> > >> >>> >> > Please don't use bad experiences from another open source >> community as >> >>> >> > leverage in this discussion. I'm sorry that things didn't go the >> way >> >>> >> > you wanted in Apache Hadoop but this is a distinct community which >> >>> >> > happens to operate under a similar open governance model. >> >>> >> >> >>> >> >> >>> >> There are some more radical and community building options as well. >> Take >> >>> >> the subversion project as a precedent. With subversion, any Apache >> >>> >> committer can request and receive a commit bit on some large >> fraction of >> >>> >> subversion. >> >>> >> >> >>> >> So why not take this a bit further and give every parquet committer >> a >> >>> >> commit bit in Arrow? Or even make them be first class committers in >> >>> Arrow? >> >>> >> Possibly even make it policy that every Parquet committer who asks >> will >> >>> be >> >>> >> given committer status in Arrow. >> >>> >> >> >>> >> That relieves a lot of the social anxiety here. Parquet committers >> >>> can't be >> >>> >> worried at that point whether their patches will get merged; they >> can >> >>> just >> >>> >> merge them. Arrow shouldn't worry much about inviting in the >> Parquet >> >>> >> committers. After all, Arrow already depends a lot on parquet so >> why not >> >>> >> invite them in? >> >>> >> >> >>> >> > > > -- > regards, > Deepak Majeti