I don't have an opinion here, but could someone send a summary of what is decided to the dev list once there is consensus? This is a long thread for parts of the project I don't work on, so I haven't followed it very closely.
On Tue, Aug 7, 2018 at 8:22 AM Wes McKinney <wesmck...@gmail.com> wrote: > > It will be difficult to track parquet-cpp changes if they get mixed with > Arrow changes. Will we establish some guidelines for filing Parquet JIRAs? > Can we enforce that parquet-cpp changes will not be committed without a > corresponding Parquet JIRA? > > I think we would use the following policy: > > * use PARQUET-XXX for issues relating to Parquet core > * use ARROW-XXX for issues relation to Arrow's consumption of Parquet > core (e.g. changes that are in parquet/arrow right now) > > We've already been dealing with annoyances relating to issues > straddling the two projects (debugging an issue on Arrow side to find > that it has to be fixed on Parquet side); this would make things > simpler for us > > > I would also like to keep changes to parquet-cpp on a separate commit to > simplify forking later (if needed) and be able to maintain the commit > history. I don't know if its possible to squash parquet-cpp commits and > arrow commits separately before merging. > > This seems rather onerous for both contributors and maintainers and > not in line with the goal of improving productivity. In the event that > we fork I see it as a traumatic event for the community. If it does > happen, then we can write a script (using git filter-branch and other > such tools) to extract commits related to the forked code. > > - Wes > > On Tue, Aug 7, 2018 at 10:37 AM, Deepak Majeti <majeti.dee...@gmail.com> > wrote: > > I have a few more logistical questions to add. > > > > It will be difficult to track parquet-cpp changes if they get mixed with > > Arrow changes. Will we establish some guidelines for filing Parquet > JIRAs? > > Can we enforce that parquet-cpp changes will not be committed without a > > corresponding Parquet JIRA? > > > > I would also like to keep changes to parquet-cpp on a separate commit to > > simplify forking later (if needed) and be able to maintain the commit > > history. I don't know if its possible to squash parquet-cpp commits and > > arrow commits separately before merging. > > > > > > On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > >> Do other people have opinions? I would like to undertake this work in > >> the near future (the next 8-10 weeks); I would be OK with taking > >> responsibility for the primary codebase surgery. > >> > >> Some logistical questions: > >> > >> * We have a handful of pull requests in flight in parquet-cpp that > >> would need to be resolved / merged > >> * We should probably cut a status-quo cpp-1.5.0 release, with future > >> releases cut out of the new structure > >> * Management of shared commit rights (I can discuss with the Arrow > >> PMC; I believe that approving any committer who has actively > >> maintained parquet-cpp should be a reasonable approach per Ted's > >> comments) > >> > >> If working more closely together proves to not be working out after > >> some period of time, I will be fully supportive of a fork or something > >> like it > >> > >> Thanks, > >> Wes > >> > >> On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <wesmck...@gmail.com> > wrote: > >> > Thanks Tim. > >> > > >> > Indeed, it's not very simple. Just today Antoine cleaned up some > >> > platform code intending to improve the performance of bit-packing in > >> > Parquet writes, and we resulted with 2 interdependent PRs > >> > > >> > * https://github.com/apache/parquet-cpp/pull/483 > >> > * https://github.com/apache/arrow/pull/2355 > >> > > >> > Changes that impact the Python interface to Parquet are even more > >> complex. > >> > > >> > Adding options to Arrow's CMake build system to only build > >> > Parquet-related code and dependencies (in a monorepo framework) would > >> > not be difficult, and amount to writing "make parquet". > >> > > >> > See e.g. https://stackoverflow.com/a/17201375. The desired commands > to > >> > build and install the Parquet core libraries and their dependencies > >> > would be: > >> > > >> > ninja parquet && ninja install > >> > > >> > - Wes > >> > > >> > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong > >> > <tarmstr...@cloudera.com.invalid> wrote: > >> >> I don't have a direct stake in this beyond wanting to see Parquet be > >> >> successful, but I thought I'd give my two cents. > >> >> > >> >> For me, the thing that makes the biggest difference in contributing > to a > >> >> new codebase is the number of steps in the workflow for writing, > >> testing, > >> >> posting and iterating on a commit and also the number of > opportunities > >> for > >> >> missteps. The size of the repo and build/test times matter but are > >> >> secondary so long as the workflow is simple and reliable. > >> >> > >> >> I don't really know what the current state of things is, but it > sounds > >> like > >> >> it's not as simple as check out -> build -> test if you're doing a > >> >> cross-repo change. Circular dependencies are a real headache. > >> >> > >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> > >> wrote: > >> >> > >> >>> hi, > >> >>> > >> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti < > >> majeti.dee...@gmail.com> > >> >>> wrote: > >> >>> > I think the circular dependency can be broken if we build a new > >> library > >> >>> for > >> >>> > the platform code. This will also make it easy for other projects > >> such as > >> >>> > ORC to use it. > >> >>> > I also remember your proposal a while ago of having a separate > >> project > >> >>> for > >> >>> > the platform code. That project can live in the arrow repo. > >> However, one > >> >>> > has to clone the entire apache arrow repo but can just build the > >> platform > >> >>> > code. This will be temporary until we can find a new home for it. > >> >>> > > >> >>> > The dependency will look like: > >> >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <- > >> >>> > libplatform(platform api) > >> >>> > > >> >>> > CI workflow will clone the arrow project twice, once for the > platform > >> >>> > library and once for the arrow-core/bindings library. > >> >>> > >> >>> This seems like an interesting proposal; the best place to work > toward > >> >>> this goal (if it is even possible; the build system interactions and > >> >>> ASF release management are the hard problems) is to have all of the > >> >>> code in a single repository. ORC could already be using Arrow if it > >> >>> wanted, but the ORC contributors aren't active in Arrow. > >> >>> > >> >>> > > >> >>> > There is no doubt that the collaborations between the Arrow and > >> Parquet > >> >>> > communities so far have been very successful. > >> >>> > The reason to maintain this relationship moving forward is to > >> continue to > >> >>> > reap the mutual benefits. > >> >>> > We should continue to take advantage of sharing code as well. > >> However, I > >> >>> > don't see any code sharing opportunities between arrow-core and > the > >> >>> > parquet-core. Both have different functions. > >> >>> > >> >>> I think you mean the Arrow columnar format. The Arrow columnar > format > >> >>> is only one part of a project that has become quite large already > >> >>> ( > >> https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development- > >> >>> platform-for-inmemory-data-105427919). > >> >>> > >> >>> > > >> >>> > We are at a point where the parquet-cpp public API is pretty > stable. > >> We > >> >>> > already passed that difficult stage. My take at arrow and parquet > is > >> to > >> >>> > keep them nimble since we can. > >> >>> > >> >>> I believe that parquet-core has progress to make yet ahead of it. We > >> >>> have done little work in asynchronous IO and concurrency which would > >> >>> yield both improved read and write throughput. This aligns well with > >> >>> other concurrency and async-IO work planned in the Arrow platform. I > >> >>> believe that more development will happen on parquet-core once the > >> >>> development process issues are resolved by having a single codebase, > >> >>> single build system, and a single CI framework. > >> >>> > >> >>> I have some gripes about design decisions made early in parquet-cpp, > >> >>> like the use of C++ exceptions. So while "stability" is a reasonable > >> >>> goal I think we should still be open to making significant changes > in > >> >>> the interest of long term progress. > >> >>> > >> >>> Having now worked on these projects for more than 2 and a half years > >> >>> and the most frequent contributor to both codebases, I'm sadly far > >> >>> past the "breaking point" and not willing to continue contributing > in > >> >>> a significant way to parquet-cpp if the projects remained structured > >> >>> as they are now. It's hampering progress and not serving the > >> >>> community. > >> >>> > >> >>> - Wes > >> >>> > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmck...@gmail.com > > > >> >>> wrote: > >> >>> > > >> >>> >> > The current Arrow adaptor code for parquet should live in the > >> arrow > >> >>> >> repo. That will remove a majority of the dependency issues. > Joshua's > >> >>> work > >> >>> >> would not have been blocked in parquet-cpp if that adapter was in > >> the > >> >>> arrow > >> >>> >> repo. This will be similar to the ORC adaptor. > >> >>> >> > >> >>> >> This has been suggested before, but I don't see how it would > >> alleviate > >> >>> >> any issues because of the significant dependencies on other > parts of > >> >>> >> the Arrow codebase. What you are proposing is: > >> >>> >> > >> >>> >> - (Arrow) arrow platform > >> >>> >> - (Parquet) parquet core > >> >>> >> - (Arrow) arrow columnar-parquet adapter interface > >> >>> >> - (Arrow) Python bindings > >> >>> >> > >> >>> >> To make this work, somehow Arrow core / libarrow would have to be > >> >>> >> built before invoking the Parquet core part of the build system. > You > >> >>> >> would need to pass dependent targets across different CMake build > >> >>> >> systems; I don't know if it's possible (I spent some time looking > >> into > >> >>> >> it earlier this year). This is what I meant by the lack of a > >> "concrete > >> >>> >> and actionable plan". The only thing that would really work > would be > >> >>> >> for the Parquet core to be "included" in the Arrow build system > >> >>> >> somehow rather than using ExternalProject. Currently Parquet > builds > >> >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow > >> build > >> >>> >> system because it's only depended upon by the Python bindings. > >> >>> >> > >> >>> >> And even if a solution could be devised, it would not wholly > resolve > >> >>> >> the CI workflow issues. > >> >>> >> > >> >>> >> You could make Parquet completely independent of the Arrow > codebase, > >> >>> >> but at that point there is little reason to maintain a > relationship > >> >>> >> between the projects or their communities. We have spent a great > >> deal > >> >>> >> of effort refactoring the two projects to enable as much code > >> sharing > >> >>> >> as there is now. > >> >>> >> > >> >>> >> - Wes > >> >>> >> > >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney < > wesmck...@gmail.com> > >> >>> wrote: > >> >>> >> >> If you still strongly feel that the only way forward is to > clone > >> the > >> >>> >> parquet-cpp repo and part ways, I will withdraw my concern. > Having > >> two > >> >>> >> parquet-cpp repos is no way a better approach. > >> >>> >> > > >> >>> >> > Yes, indeed. In my view, the next best option after a monorepo > is > >> to > >> >>> >> > fork. That would obviously be a bad outcome for the community. > >> >>> >> > > >> >>> >> > It doesn't look like I will be able to convince you that a > >> monorepo is > >> >>> >> > a good idea; what I would ask instead is that you be willing to > >> give > >> >>> >> > it a shot, and if it turns out in the way you're describing > >> (which I > >> >>> >> > don't think it will) then I suggest that we fork at that point. > >> >>> >> > > >> >>> >> > - Wes > >> >>> >> > > >> >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti < > >> >>> majeti.dee...@gmail.com> > >> >>> >> wrote: > >> >>> >> >> Wes, > >> >>> >> >> > >> >>> >> >> Unfortunately, I cannot show you any practical fact-based > >> problems > >> >>> of a > >> >>> >> >> non-existent Arrow-Parquet mono-repo. > >> >>> >> >> Bringing in related Apache community experiences are more > >> meaningful > >> >>> >> than > >> >>> >> >> how mono-repos work at Google and other big organizations. > >> >>> >> >> We solely depend on volunteers and cannot hire full-time > >> developers. > >> >>> >> >> You are very well aware of how difficult it has been to find > more > >> >>> >> >> contributors and maintainers for Arrow. parquet-cpp already > has > >> a low > >> >>> >> >> contribution rate to its core components. > >> >>> >> >> > >> >>> >> >> We should target to ensure that new volunteers who want to > >> contribute > >> >>> >> >> bug-fixes/features should spend the least amount of time in > >> figuring > >> >>> out > >> >>> >> >> the project repo. We can never come up with an automated build > >> system > >> >>> >> that > >> >>> >> >> caters to every possible environment. > >> >>> >> >> My only concern is if the mono-repo will make it harder for > new > >> >>> >> developers > >> >>> >> >> to work on parquet-cpp core just due to the additional code, > >> build > >> >>> and > >> >>> >> test > >> >>> >> >> dependencies. > >> >>> >> >> I am not saying that the Arrow community/committers will be > less > >> >>> >> >> co-operative. > >> >>> >> >> I just don't think the mono-repo structure model will be > >> sustainable > >> >>> in > >> >>> >> an > >> >>> >> >> open source community unless there are long-term vested > >> interests. We > >> >>> >> can't > >> >>> >> >> predict that. > >> >>> >> >> > >> >>> >> >> The current circular dependency problems between Arrow and > >> Parquet > >> >>> is a > >> >>> >> >> major problem for the community and it is important. > >> >>> >> >> > >> >>> >> >> The current Arrow adaptor code for parquet should live in the > >> arrow > >> >>> >> repo. > >> >>> >> >> That will remove a majority of the dependency issues. > >> >>> >> >> Joshua's work would not have been blocked in parquet-cpp if > that > >> >>> adapter > >> >>> >> >> was in the arrow repo. This will be similar to the ORC > adaptor. > >> >>> >> >> > >> >>> >> >> The platform API code is pretty stable at this point. Minor > >> changes > >> >>> in > >> >>> >> the > >> >>> >> >> future to this code should not be the main reason to combine > the > >> >>> arrow > >> >>> >> >> parquet repos. > >> >>> >> >> > >> >>> >> >> " > >> >>> >> >> *I question whether it's worth the community's time long term > to > >> >>> wear* > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in > >> >>> >> eachlibrary > >> >>> >> >> to plug components together rather than utilizing > commonplatform > >> >>> APIs.*" > >> >>> >> >> > >> >>> >> >> My answer to your question below would be "Yes". > >> >>> Modularity/separation > >> >>> >> is > >> >>> >> >> very important in an open source community where priorities of > >> >>> >> contributors > >> >>> >> >> are often short term. > >> >>> >> >> The retention is low and therefore the acquisition costs > should > >> be > >> >>> low > >> >>> >> as > >> >>> >> >> well. This is the community over code approach according to > me. > >> Minor > >> >>> >> code > >> >>> >> >> duplication is not a deal breaker. > >> >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the > big > >> >>> data > >> >>> >> >> space serving their own functions. > >> >>> >> >> > >> >>> >> >> If you still strongly feel that the only way forward is to > clone > >> the > >> >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. > >> Having > >> >>> two > >> >>> >> >> parquet-cpp repos is no way a better approach. > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney < > >> wesmck...@gmail.com> > >> >>> >> wrote: > >> >>> >> >> > >> >>> >> >>> @Antoine > >> >>> >> >>> > >> >>> >> >>> > By the way, one concern with the monorepo approach: it > would > >> >>> slightly > >> >>> >> >>> increase Arrow CI times (which are already too large). > >> >>> >> >>> > >> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes: > >> >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750 > >> >>> >> >>> > >> >>> >> >>> Parquet run takes about 28 > >> >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208 > >> >>> >> >>> > >> >>> >> >>> Inevitably we will need to create some kind of bot to run > >> certain > >> >>> >> >>> builds on-demand based on commit / PR metadata or on request. > >> >>> >> >>> > >> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build > >> could be > >> >>> >> >>> made substantially shorter by moving some of the slower parts > >> (like > >> >>> >> >>> the Python ASV benchmarks) from being tested every-commit to > >> nightly > >> >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would > >> also > >> >>> >> >>> improve build times (valgrind build could be moved to a > nightly > >> >>> >> >>> exhaustive test run) > >> >>> >> >>> > >> >>> >> >>> - Wes > >> >>> >> >>> > >> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney < > >> wesmck...@gmail.com > >> >>> > > >> >>> >> >>> wrote: > >> >>> >> >>> >> I would like to point out that arrow's use of orc is a > great > >> >>> >> example of > >> >>> >> >>> how it would be possible to manage parquet-cpp as a separate > >> >>> codebase. > >> >>> >> That > >> >>> >> >>> gives me hope that the projects could be managed separately > some > >> >>> day. > >> >>> >> >>> > > >> >>> >> >>> > Well, I don't know that ORC is the best example. The ORC > C++ > >> >>> codebase > >> >>> >> >>> > features several areas of duplicated logic which could be > >> >>> replaced by > >> >>> >> >>> > components from the Arrow platform for better platform-wide > >> >>> >> >>> > interoperability: > >> >>> >> >>> > > >> >>> >> >>> > > >> >>> >> >>> > >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ > >> >>> orc/OrcFile.hh#L37 > >> >>> >> >>> > > >> >>> >> > >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh > >> >>> >> >>> > > >> >>> >> >>> > >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ > >> >>> orc/MemoryPool.hh > >> >>> >> >>> > > >> >>> >> > >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh > >> >>> >> >>> > > >> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/ > >> >>> OutputStream.hh > >> >>> >> >>> > > >> >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a > >> cause of > >> >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent > >> them > >> >>> from > >> >>> >> >>> > leaking to third party linkers when statically linked (ORC > is > >> only > >> >>> >> >>> > available for static linking at the moment AFAIK). > >> >>> >> >>> > > >> >>> >> >>> > I question whether it's worth the community's time long > term > >> to > >> >>> wear > >> >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces > in > >> each > >> >>> >> >>> > library to plug components together rather than utilizing > >> common > >> >>> >> >>> > platform APIs. > >> >>> >> >>> > > >> >>> >> >>> > - Wes > >> >>> >> >>> > > >> >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck < > >> >>> >> joshuasto...@gmail.com> > >> >>> >> >>> wrote: > >> >>> >> >>> >> You're point about the constraints of the ASF release > >> process are > >> >>> >> well > >> >>> >> >>> >> taken and as a developer who's trying to work in the > current > >> >>> >> >>> environment I > >> >>> >> >>> >> would be much happier if the codebases were merged. The > main > >> >>> issues > >> >>> >> I > >> >>> >> >>> worry > >> >>> >> >>> >> about when you put codebases like these together are: > >> >>> >> >>> >> > >> >>> >> >>> >> 1. The delineation of API's become blurred and the code > >> becomes > >> >>> too > >> >>> >> >>> coupled > >> >>> >> >>> >> 2. Release of artifacts that are lower in the dependency > >> tree are > >> >>> >> >>> delayed > >> >>> >> >>> >> by artifacts higher in the dependency tree > >> >>> >> >>> >> > >> >>> >> >>> >> If the project/release management is structured well and > >> someone > >> >>> >> keeps > >> >>> >> >>> an > >> >>> >> >>> >> eye on the coupling, then I don't have any concerns. > >> >>> >> >>> >> > >> >>> >> >>> >> I would like to point out that arrow's use of orc is a > great > >> >>> >> example of > >> >>> >> >>> how > >> >>> >> >>> >> it would be possible to manage parquet-cpp as a separate > >> >>> codebase. > >> >>> >> That > >> >>> >> >>> >> gives me hope that the projects could be managed > separately > >> some > >> >>> >> day. > >> >>> >> >>> >> > >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney < > >> >>> wesmck...@gmail.com> > >> >>> >> >>> wrote: > >> >>> >> >>> >> > >> >>> >> >>> >>> hi Josh, > >> >>> >> >>> >>> > >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve > >> arrow > >> >>> and > >> >>> >> >>> tying > >> >>> >> >>> >>> them together seems like the wrong choice. > >> >>> >> >>> >>> > >> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same > >> people > >> >>> >> >>> >>> building these projects -- my argument (which I think you > >> agree > >> >>> >> with?) > >> >>> >> >>> >>> is that we should work more closely together until the > >> community > >> >>> >> grows > >> >>> >> >>> >>> large enough to support larger-scope process than we have > >> now. > >> >>> As > >> >>> >> >>> >>> you've seen, our process isn't serving developers of > these > >> >>> >> projects. > >> >>> >> >>> >>> > >> >>> >> >>> >>> > I also think build tooling should be pulled into its > own > >> >>> >> codebase. > >> >>> >> >>> >>> > >> >>> >> >>> >>> I don't see how this can possibly be practical taking > into > >> >>> >> >>> >>> consideration the constraints imposed by the combination > of > >> the > >> >>> >> GitHub > >> >>> >> >>> >>> platform and the ASF release process. I'm all for being > >> >>> idealistic, > >> >>> >> >>> >>> but right now we need to be practical. Unless we can > devise > >> a > >> >>> >> >>> >>> practical procedure that can accommodate at least 1 patch > >> per > >> >>> day > >> >>> >> >>> >>> which may touch both code and build system simultaneously > >> >>> without > >> >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't > see > >> how > >> >>> we > >> >>> >> can > >> >>> >> >>> >>> move forward. > >> >>> >> >>> >>> > >> >>> >> >>> >>> > That being said, I think it makes sense to merge the > >> codebases > >> >>> >> in the > >> >>> >> >>> >>> short term with the express purpose of separating them in > >> the > >> >>> near > >> >>> >> >>> term. > >> >>> >> >>> >>> > >> >>> >> >>> >>> I would agree but only if separation can be demonstrated > to > >> be > >> >>> >> >>> >>> practical and result in net improvements in productivity > and > >> >>> >> community > >> >>> >> >>> >>> growth. I think experience has clearly demonstrated that > the > >> >>> >> current > >> >>> >> >>> >>> separation is impractical, and is causing problems. > >> >>> >> >>> >>> > >> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to > consider > >> >>> >> >>> >>> development process and ASF releases separately. My > >> argument is > >> >>> as > >> >>> >> >>> >>> follows: > >> >>> >> >>> >>> > >> >>> >> >>> >>> * Monorepo for development (for practicality) > >> >>> >> >>> >>> * Releases structured according to the desires of the > PMCs > >> >>> >> >>> >>> > >> >>> >> >>> >>> - Wes > >> >>> >> >>> >>> > >> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck < > >> >>> >> joshuasto...@gmail.com > >> >>> >> >>> > > >> >>> >> >>> >>> wrote: > >> >>> >> >>> >>> > I recently worked on an issue that had to be > implemented > >> in > >> >>> >> >>> parquet-cpp > >> >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow > >> >>> >> (ARROW-2585, > >> >>> >> >>> >>> > ARROW-2586). I found the circular dependencies > confusing > >> and > >> >>> >> hard to > >> >>> >> >>> work > >> >>> >> >>> >>> > with. For example, I still have a PR open in > parquet-cpp > >> >>> >> (created on > >> >>> >> >>> May > >> >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that > was > >> >>> >> recently > >> >>> >> >>> >>> merged. > >> >>> >> >>> >>> > I couldn't even address any CI issues in the PR because > >> the > >> >>> >> change in > >> >>> >> >>> >>> arrow > >> >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the > >> >>> >> >>> >>> run_clang_format.py > >> >>> >> >>> >>> > script in the arrow project only to find out later that > >> there > >> >>> >> was an > >> >>> >> >>> >>> exact > >> >>> >> >>> >>> > copy of it in parquet-cpp. > >> >>> >> >>> >>> > > >> >>> >> >>> >>> > However, I don't think merging the codebases makes > sense > >> in > >> >>> the > >> >>> >> long > >> >>> >> >>> >>> term. > >> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve > >> arrow > >> >>> and > >> >>> >> >>> tying > >> >>> >> >>> >>> them > >> >>> >> >>> >>> > together seems like the wrong choice. There will be > other > >> >>> formats > >> >>> >> >>> that > >> >>> >> >>> >>> > arrow needs to support that will be kept separate > (e.g. - > >> >>> Orc), > >> >>> >> so I > >> >>> >> >>> >>> don't > >> >>> >> >>> >>> > see why parquet should be special. I also think build > >> tooling > >> >>> >> should > >> >>> >> >>> be > >> >>> >> >>> >>> > pulled into its own codebase. GNU has had a long > history > >> of > >> >>> >> >>> developing > >> >>> >> >>> >>> open > >> >>> >> >>> >>> > source C/C++ projects that way and made projects like > >> >>> >> >>> >>> > autoconf/automake/make to support them. I don't think > CI > >> is a > >> >>> >> good > >> >>> >> >>> >>> > counter-example since there have been lots of > successful > >> open > >> >>> >> source > >> >>> >> >>> >>> > projects that have used nightly build systems that > pinned > >> >>> >> versions of > >> >>> >> >>> >>> > dependent software. > >> >>> >> >>> >>> > > >> >>> >> >>> >>> > That being said, I think it makes sense to merge the > >> codebases > >> >>> >> in the > >> >>> >> >>> >>> short > >> >>> >> >>> >>> > term with the express purpose of separating them in the > >> near > >> >>> >> term. > >> >>> >> >>> My > >> >>> >> >>> >>> > reasoning is as follows. By putting the codebases > >> together, > >> >>> you > >> >>> >> can > >> >>> >> >>> more > >> >>> >> >>> >>> > easily delineate the boundaries between the API's with > a > >> >>> single > >> >>> >> PR. > >> >>> >> >>> >>> Second, > >> >>> >> >>> >>> > it will force the build tooling to converge instead of > >> >>> diverge, > >> >>> >> >>> which has > >> >>> >> >>> >>> > already happened. Once the boundaries and tooling have > >> been > >> >>> >> sorted > >> >>> >> >>> out, > >> >>> >> >>> >>> it > >> >>> >> >>> >>> > should be easy to separate them back into their own > >> codebases. > >> >>> >> >>> >>> > > >> >>> >> >>> >>> > If the codebases are merged, I would ask that the C++ > >> >>> codebases > >> >>> >> for > >> >>> >> >>> arrow > >> >>> >> >>> >>> > be separated from other languages. Looking at it from > the > >> >>> >> >>> perspective of > >> >>> >> >>> >>> a > >> >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java > is a > >> >>> large > >> >>> >> tax > >> >>> >> >>> to > >> >>> >> >>> >>> pay > >> >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's > >> in the > >> >>> >> 0.10.0 > >> >>> >> >>> >>> > release of arrow, many of which were holding up the > >> release. I > >> >>> >> hope > >> >>> >> >>> that > >> >>> >> >>> >>> > seems like a reasonable compromise, and I think it will > >> help > >> >>> >> reduce > >> >>> >> >>> the > >> >>> >> >>> >>> > complexity of the build/release tooling. > >> >>> >> >>> >>> > > >> >>> >> >>> >>> > > >> >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning < > >> >>> >> ted.dunn...@gmail.com> > >> >>> >> >>> >>> wrote: > >> >>> >> >>> >>> > > >> >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney < > >> >>> >> wesmck...@gmail.com> > >> >>> >> >>> >>> wrote: > >> >>> >> >>> >>> >> > >> >>> >> >>> >>> >> > > >> >>> >> >>> >>> >> > > The community will be less willing to accept large > >> >>> >> >>> >>> >> > > changes that require multiple rounds of patches > for > >> >>> >> stability > >> >>> >> >>> and > >> >>> >> >>> >>> API > >> >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the > >> HDFS > >> >>> >> >>> community > >> >>> >> >>> >>> took > >> >>> >> >>> >>> >> a > >> >>> >> >>> >>> >> > > significantly long time for the very same reason. > >> >>> >> >>> >>> >> > > >> >>> >> >>> >>> >> > Please don't use bad experiences from another open > >> source > >> >>> >> >>> community as > >> >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things > >> didn't > >> >>> go > >> >>> >> the > >> >>> >> >>> way > >> >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct > >> >>> community > >> >>> >> which > >> >>> >> >>> >>> >> > happens to operate under a similar open governance > >> model. > >> >>> >> >>> >>> >> > >> >>> >> >>> >>> >> > >> >>> >> >>> >>> >> There are some more radical and community building > >> options as > >> >>> >> well. > >> >>> >> >>> Take > >> >>> >> >>> >>> >> the subversion project as a precedent. With > subversion, > >> any > >> >>> >> Apache > >> >>> >> >>> >>> >> committer can request and receive a commit bit on some > >> large > >> >>> >> >>> fraction of > >> >>> >> >>> >>> >> subversion. > >> >>> >> >>> >>> >> > >> >>> >> >>> >>> >> So why not take this a bit further and give every > parquet > >> >>> >> committer > >> >>> >> >>> a > >> >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class > >> >>> >> committers in > >> >>> >> >>> >>> Arrow? > >> >>> >> >>> >>> >> Possibly even make it policy that every Parquet > >> committer who > >> >>> >> asks > >> >>> >> >>> will > >> >>> >> >>> >>> be > >> >>> >> >>> >>> >> given committer status in Arrow. > >> >>> >> >>> >>> >> > >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here. > Parquet > >> >>> >> committers > >> >>> >> >>> >>> can't be > >> >>> >> >>> >>> >> worried at that point whether their patches will get > >> merged; > >> >>> >> they > >> >>> >> >>> can > >> >>> >> >>> >>> just > >> >>> >> >>> >>> >> merge them. Arrow shouldn't worry much about inviting > >> in the > >> >>> >> >>> Parquet > >> >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on > >> >>> parquet so > >> >>> >> >>> why not > >> >>> >> >>> >>> >> invite them in? > >> >>> >> >>> >>> >> > >> >>> >> >>> >>> > >> >>> >> >>> > >> >>> >> >> > >> >>> >> >> > >> >>> >> >> -- > >> >>> >> >> regards, > >> >>> >> >> Deepak Majeti > >> >>> >> > >> >>> > > >> >>> > > >> >>> > -- > >> >>> > regards, > >> >>> > Deepak Majeti > >> >>> > >> > > > > > > -- > > regards, > > Deepak Majeti > -- Ryan Blue Software Engineer Netflix