Do other people have opinions? I would like to undertake this work in the near future (the next 8-10 weeks); I would be OK with taking responsibility for the primary codebase surgery.
Some logistical questions: * We have a handful of pull requests in flight in parquet-cpp that would need to be resolved / merged * We should probably cut a status-quo cpp-1.5.0 release, with future releases cut out of the new structure * Management of shared commit rights (I can discuss with the Arrow PMC; I believe that approving any committer who has actively maintained parquet-cpp should be a reasonable approach per Ted's comments) If working more closely together proves to not be working out after some period of time, I will be fully supportive of a fork or something like it Thanks, Wes On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <wesmck...@gmail.com> wrote: > Thanks Tim. > > Indeed, it's not very simple. Just today Antoine cleaned up some > platform code intending to improve the performance of bit-packing in > Parquet writes, and we resulted with 2 interdependent PRs > > * https://github.com/apache/parquet-cpp/pull/483 > * https://github.com/apache/arrow/pull/2355 > > Changes that impact the Python interface to Parquet are even more complex. > > Adding options to Arrow's CMake build system to only build > Parquet-related code and dependencies (in a monorepo framework) would > not be difficult, and amount to writing "make parquet". > > See e.g. https://stackoverflow.com/a/17201375. The desired commands to > build and install the Parquet core libraries and their dependencies > would be: > > ninja parquet && ninja install > > - Wes > > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong > <tarmstr...@cloudera.com.invalid> wrote: >> I don't have a direct stake in this beyond wanting to see Parquet be >> successful, but I thought I'd give my two cents. >> >> For me, the thing that makes the biggest difference in contributing to a >> new codebase is the number of steps in the workflow for writing, testing, >> posting and iterating on a commit and also the number of opportunities for >> missteps. The size of the repo and build/test times matter but are >> secondary so long as the workflow is simple and reliable. >> >> I don't really know what the current state of things is, but it sounds like >> it's not as simple as check out -> build -> test if you're doing a >> cross-repo change. Circular dependencies are a real headache. >> >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> >>> hi, >>> >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <majeti.dee...@gmail.com> >>> wrote: >>> > I think the circular dependency can be broken if we build a new library >>> for >>> > the platform code. This will also make it easy for other projects such as >>> > ORC to use it. >>> > I also remember your proposal a while ago of having a separate project >>> for >>> > the platform code. That project can live in the arrow repo. However, one >>> > has to clone the entire apache arrow repo but can just build the platform >>> > code. This will be temporary until we can find a new home for it. >>> > >>> > The dependency will look like: >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <- >>> > libplatform(platform api) >>> > >>> > CI workflow will clone the arrow project twice, once for the platform >>> > library and once for the arrow-core/bindings library. >>> >>> This seems like an interesting proposal; the best place to work toward >>> this goal (if it is even possible; the build system interactions and >>> ASF release management are the hard problems) is to have all of the >>> code in a single repository. ORC could already be using Arrow if it >>> wanted, but the ORC contributors aren't active in Arrow. >>> >>> > >>> > There is no doubt that the collaborations between the Arrow and Parquet >>> > communities so far have been very successful. >>> > The reason to maintain this relationship moving forward is to continue to >>> > reap the mutual benefits. >>> > We should continue to take advantage of sharing code as well. However, I >>> > don't see any code sharing opportunities between arrow-core and the >>> > parquet-core. Both have different functions. >>> >>> I think you mean the Arrow columnar format. The Arrow columnar format >>> is only one part of a project that has become quite large already >>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development- >>> platform-for-inmemory-data-105427919). >>> >>> > >>> > We are at a point where the parquet-cpp public API is pretty stable. We >>> > already passed that difficult stage. My take at arrow and parquet is to >>> > keep them nimble since we can. >>> >>> I believe that parquet-core has progress to make yet ahead of it. We >>> have done little work in asynchronous IO and concurrency which would >>> yield both improved read and write throughput. This aligns well with >>> other concurrency and async-IO work planned in the Arrow platform. I >>> believe that more development will happen on parquet-core once the >>> development process issues are resolved by having a single codebase, >>> single build system, and a single CI framework. >>> >>> I have some gripes about design decisions made early in parquet-cpp, >>> like the use of C++ exceptions. So while "stability" is a reasonable >>> goal I think we should still be open to making significant changes in >>> the interest of long term progress. >>> >>> Having now worked on these projects for more than 2 and a half years >>> and the most frequent contributor to both codebases, I'm sadly far >>> past the "breaking point" and not willing to continue contributing in >>> a significant way to parquet-cpp if the projects remained structured >>> as they are now. It's hampering progress and not serving the >>> community. >>> >>> - Wes >>> >>> > >>> > >>> > >>> > >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> > >>> >> > The current Arrow adaptor code for parquet should live in the arrow >>> >> repo. That will remove a majority of the dependency issues. Joshua's >>> work >>> >> would not have been blocked in parquet-cpp if that adapter was in the >>> arrow >>> >> repo. This will be similar to the ORC adaptor. >>> >> >>> >> This has been suggested before, but I don't see how it would alleviate >>> >> any issues because of the significant dependencies on other parts of >>> >> the Arrow codebase. What you are proposing is: >>> >> >>> >> - (Arrow) arrow platform >>> >> - (Parquet) parquet core >>> >> - (Arrow) arrow columnar-parquet adapter interface >>> >> - (Arrow) Python bindings >>> >> >>> >> To make this work, somehow Arrow core / libarrow would have to be >>> >> built before invoking the Parquet core part of the build system. You >>> >> would need to pass dependent targets across different CMake build >>> >> systems; I don't know if it's possible (I spent some time looking into >>> >> it earlier this year). This is what I meant by the lack of a "concrete >>> >> and actionable plan". The only thing that would really work would be >>> >> for the Parquet core to be "included" in the Arrow build system >>> >> somehow rather than using ExternalProject. Currently Parquet builds >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build >>> >> system because it's only depended upon by the Python bindings. >>> >> >>> >> And even if a solution could be devised, it would not wholly resolve >>> >> the CI workflow issues. >>> >> >>> >> You could make Parquet completely independent of the Arrow codebase, >>> >> but at that point there is little reason to maintain a relationship >>> >> between the projects or their communities. We have spent a great deal >>> >> of effort refactoring the two projects to enable as much code sharing >>> >> as there is now. >>> >> >>> >> - Wes >>> >> >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> >> >> If you still strongly feel that the only way forward is to clone the >>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two >>> >> parquet-cpp repos is no way a better approach. >>> >> > >>> >> > Yes, indeed. In my view, the next best option after a monorepo is to >>> >> > fork. That would obviously be a bad outcome for the community. >>> >> > >>> >> > It doesn't look like I will be able to convince you that a monorepo is >>> >> > a good idea; what I would ask instead is that you be willing to give >>> >> > it a shot, and if it turns out in the way you're describing (which I >>> >> > don't think it will) then I suggest that we fork at that point. >>> >> > >>> >> > - Wes >>> >> > >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti < >>> majeti.dee...@gmail.com> >>> >> wrote: >>> >> >> Wes, >>> >> >> >>> >> >> Unfortunately, I cannot show you any practical fact-based problems >>> of a >>> >> >> non-existent Arrow-Parquet mono-repo. >>> >> >> Bringing in related Apache community experiences are more meaningful >>> >> than >>> >> >> how mono-repos work at Google and other big organizations. >>> >> >> We solely depend on volunteers and cannot hire full-time developers. >>> >> >> You are very well aware of how difficult it has been to find more >>> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low >>> >> >> contribution rate to its core components. >>> >> >> >>> >> >> We should target to ensure that new volunteers who want to contribute >>> >> >> bug-fixes/features should spend the least amount of time in figuring >>> out >>> >> >> the project repo. We can never come up with an automated build system >>> >> that >>> >> >> caters to every possible environment. >>> >> >> My only concern is if the mono-repo will make it harder for new >>> >> developers >>> >> >> to work on parquet-cpp core just due to the additional code, build >>> and >>> >> test >>> >> >> dependencies. >>> >> >> I am not saying that the Arrow community/committers will be less >>> >> >> co-operative. >>> >> >> I just don't think the mono-repo structure model will be sustainable >>> in >>> >> an >>> >> >> open source community unless there are long-term vested interests. We >>> >> can't >>> >> >> predict that. >>> >> >> >>> >> >> The current circular dependency problems between Arrow and Parquet >>> is a >>> >> >> major problem for the community and it is important. >>> >> >> >>> >> >> The current Arrow adaptor code for parquet should live in the arrow >>> >> repo. >>> >> >> That will remove a majority of the dependency issues. >>> >> >> Joshua's work would not have been blocked in parquet-cpp if that >>> adapter >>> >> >> was in the arrow repo. This will be similar to the ORC adaptor. >>> >> >> >>> >> >> The platform API code is pretty stable at this point. Minor changes >>> in >>> >> the >>> >> >> future to this code should not be the main reason to combine the >>> arrow >>> >> >> parquet repos. >>> >> >> >>> >> >> " >>> >> >> *I question whether it's worth the community's time long term to >>> wear* >>> >> >> >>> >> >> >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in >>> >> eachlibrary >>> >> >> to plug components together rather than utilizing commonplatform >>> APIs.*" >>> >> >> >>> >> >> My answer to your question below would be "Yes". >>> Modularity/separation >>> >> is >>> >> >> very important in an open source community where priorities of >>> >> contributors >>> >> >> are often short term. >>> >> >> The retention is low and therefore the acquisition costs should be >>> low >>> >> as >>> >> >> well. This is the community over code approach according to me. Minor >>> >> code >>> >> >> duplication is not a deal breaker. >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big >>> data >>> >> >> space serving their own functions. >>> >> >> >>> >> >> If you still strongly feel that the only way forward is to clone the >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having >>> two >>> >> >> parquet-cpp repos is no way a better approach. >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <wesmck...@gmail.com> >>> >> wrote: >>> >> >> >>> >> >>> @Antoine >>> >> >>> >>> >> >>> > By the way, one concern with the monorepo approach: it would >>> slightly >>> >> >>> increase Arrow CI times (which are already too large). >>> >> >>> >>> >> >>> A typical CI run in Arrow is taking about 45 minutes: >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750 >>> >> >>> >>> >> >>> Parquet run takes about 28 >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208 >>> >> >>> >>> >> >>> Inevitably we will need to create some kind of bot to run certain >>> >> >>> builds on-demand based on commit / PR metadata or on request. >>> >> >>> >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be >>> >> >>> made substantially shorter by moving some of the slower parts (like >>> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also >>> >> >>> improve build times (valgrind build could be moved to a nightly >>> >> >>> exhaustive test run) >>> >> >>> >>> >> >>> - Wes >>> >> >>> >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com >>> > >>> >> >>> wrote: >>> >> >>> >> I would like to point out that arrow's use of orc is a great >>> >> example of >>> >> >>> how it would be possible to manage parquet-cpp as a separate >>> codebase. >>> >> That >>> >> >>> gives me hope that the projects could be managed separately some >>> day. >>> >> >>> > >>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++ >>> codebase >>> >> >>> > features several areas of duplicated logic which could be >>> replaced by >>> >> >>> > components from the Arrow platform for better platform-wide >>> >> >>> > interoperability: >>> >> >>> > >>> >> >>> > >>> >> >>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ >>> orc/OrcFile.hh#L37 >>> >> >>> > >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh >>> >> >>> > >>> >> >>> >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ >>> orc/MemoryPool.hh >>> >> >>> > >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh >>> >> >>> > >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/ >>> OutputStream.hh >>> >> >>> > >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent them >>> from >>> >> >>> > leaking to third party linkers when statically linked (ORC is only >>> >> >>> > available for static linking at the moment AFAIK). >>> >> >>> > >>> >> >>> > I question whether it's worth the community's time long term to >>> wear >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each >>> >> >>> > library to plug components together rather than utilizing common >>> >> >>> > platform APIs. >>> >> >>> > >>> >> >>> > - Wes >>> >> >>> > >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck < >>> >> joshuasto...@gmail.com> >>> >> >>> wrote: >>> >> >>> >> You're point about the constraints of the ASF release process are >>> >> well >>> >> >>> >> taken and as a developer who's trying to work in the current >>> >> >>> environment I >>> >> >>> >> would be much happier if the codebases were merged. The main >>> issues >>> >> I >>> >> >>> worry >>> >> >>> >> about when you put codebases like these together are: >>> >> >>> >> >>> >> >>> >> 1. The delineation of API's become blurred and the code becomes >>> too >>> >> >>> coupled >>> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are >>> >> >>> delayed >>> >> >>> >> by artifacts higher in the dependency tree >>> >> >>> >> >>> >> >>> >> If the project/release management is structured well and someone >>> >> keeps >>> >> >>> an >>> >> >>> >> eye on the coupling, then I don't have any concerns. >>> >> >>> >> >>> >> >>> >> I would like to point out that arrow's use of orc is a great >>> >> example of >>> >> >>> how >>> >> >>> >> it would be possible to manage parquet-cpp as a separate >>> codebase. >>> >> That >>> >> >>> >> gives me hope that the projects could be managed separately some >>> >> day. >>> >> >>> >> >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney < >>> wesmck...@gmail.com> >>> >> >>> wrote: >>> >> >>> >> >>> >> >>> >>> hi Josh, >>> >> >>> >>> >>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow >>> and >>> >> >>> tying >>> >> >>> >>> them together seems like the wrong choice. >>> >> >>> >>> >>> >> >>> >>> Apache is "Community over Code"; right now it's the same people >>> >> >>> >>> building these projects -- my argument (which I think you agree >>> >> with?) >>> >> >>> >>> is that we should work more closely together until the community >>> >> grows >>> >> >>> >>> large enough to support larger-scope process than we have now. >>> As >>> >> >>> >>> you've seen, our process isn't serving developers of these >>> >> projects. >>> >> >>> >>> >>> >> >>> >>> > I also think build tooling should be pulled into its own >>> >> codebase. >>> >> >>> >>> >>> >> >>> >>> I don't see how this can possibly be practical taking into >>> >> >>> >>> consideration the constraints imposed by the combination of the >>> >> GitHub >>> >> >>> >>> platform and the ASF release process. I'm all for being >>> idealistic, >>> >> >>> >>> but right now we need to be practical. Unless we can devise a >>> >> >>> >>> practical procedure that can accommodate at least 1 patch per >>> day >>> >> >>> >>> which may touch both code and build system simultaneously >>> without >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how >>> we >>> >> can >>> >> >>> >>> move forward. >>> >> >>> >>> >>> >> >>> >>> > That being said, I think it makes sense to merge the codebases >>> >> in the >>> >> >>> >>> short term with the express purpose of separating them in the >>> near >>> >> >>> term. >>> >> >>> >>> >>> >> >>> >>> I would agree but only if separation can be demonstrated to be >>> >> >>> >>> practical and result in net improvements in productivity and >>> >> community >>> >> >>> >>> growth. I think experience has clearly demonstrated that the >>> >> current >>> >> >>> >>> separation is impractical, and is causing problems. >>> >> >>> >>> >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider >>> >> >>> >>> development process and ASF releases separately. My argument is >>> as >>> >> >>> >>> follows: >>> >> >>> >>> >>> >> >>> >>> * Monorepo for development (for practicality) >>> >> >>> >>> * Releases structured according to the desires of the PMCs >>> >> >>> >>> >>> >> >>> >>> - Wes >>> >> >>> >>> >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck < >>> >> joshuasto...@gmail.com >>> >> >>> > >>> >> >>> >>> wrote: >>> >> >>> >>> > I recently worked on an issue that had to be implemented in >>> >> >>> parquet-cpp >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow >>> >> (ARROW-2585, >>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and >>> >> hard to >>> >> >>> work >>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp >>> >> (created on >>> >> >>> May >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was >>> >> recently >>> >> >>> >>> merged. >>> >> >>> >>> > I couldn't even address any CI issues in the PR because the >>> >> change in >>> >> >>> >>> arrow >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the >>> >> >>> >>> run_clang_format.py >>> >> >>> >>> > script in the arrow project only to find out later that there >>> >> was an >>> >> >>> >>> exact >>> >> >>> >>> > copy of it in parquet-cpp. >>> >> >>> >>> > >>> >> >>> >>> > However, I don't think merging the codebases makes sense in >>> the >>> >> long >>> >> >>> >>> term. >>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow >>> and >>> >> >>> tying >>> >> >>> >>> them >>> >> >>> >>> > together seems like the wrong choice. There will be other >>> formats >>> >> >>> that >>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. - >>> Orc), >>> >> so I >>> >> >>> >>> don't >>> >> >>> >>> > see why parquet should be special. I also think build tooling >>> >> should >>> >> >>> be >>> >> >>> >>> > pulled into its own codebase. GNU has had a long history of >>> >> >>> developing >>> >> >>> >>> open >>> >> >>> >>> > source C/C++ projects that way and made projects like >>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a >>> >> good >>> >> >>> >>> > counter-example since there have been lots of successful open >>> >> source >>> >> >>> >>> > projects that have used nightly build systems that pinned >>> >> versions of >>> >> >>> >>> > dependent software. >>> >> >>> >>> > >>> >> >>> >>> > That being said, I think it makes sense to merge the codebases >>> >> in the >>> >> >>> >>> short >>> >> >>> >>> > term with the express purpose of separating them in the near >>> >> term. >>> >> >>> My >>> >> >>> >>> > reasoning is as follows. By putting the codebases together, >>> you >>> >> can >>> >> >>> more >>> >> >>> >>> > easily delineate the boundaries between the API's with a >>> single >>> >> PR. >>> >> >>> >>> Second, >>> >> >>> >>> > it will force the build tooling to converge instead of >>> diverge, >>> >> >>> which has >>> >> >>> >>> > already happened. Once the boundaries and tooling have been >>> >> sorted >>> >> >>> out, >>> >> >>> >>> it >>> >> >>> >>> > should be easy to separate them back into their own codebases. >>> >> >>> >>> > >>> >> >>> >>> > If the codebases are merged, I would ask that the C++ >>> codebases >>> >> for >>> >> >>> arrow >>> >> >>> >>> > be separated from other languages. Looking at it from the >>> >> >>> perspective of >>> >> >>> >>> a >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a >>> large >>> >> tax >>> >> >>> to >>> >> >>> >>> pay >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the >>> >> 0.10.0 >>> >> >>> >>> > release of arrow, many of which were holding up the release. I >>> >> hope >>> >> >>> that >>> >> >>> >>> > seems like a reasonable compromise, and I think it will help >>> >> reduce >>> >> >>> the >>> >> >>> >>> > complexity of the build/release tooling. >>> >> >>> >>> > >>> >> >>> >>> > >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning < >>> >> ted.dunn...@gmail.com> >>> >> >>> >>> wrote: >>> >> >>> >>> > >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney < >>> >> wesmck...@gmail.com> >>> >> >>> >>> wrote: >>> >> >>> >>> >> >>> >> >>> >>> >> > >>> >> >>> >>> >> > > The community will be less willing to accept large >>> >> >>> >>> >> > > changes that require multiple rounds of patches for >>> >> stability >>> >> >>> and >>> >> >>> >>> API >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS >>> >> >>> community >>> >> >>> >>> took >>> >> >>> >>> >> a >>> >> >>> >>> >> > > significantly long time for the very same reason. >>> >> >>> >>> >> > >>> >> >>> >>> >> > Please don't use bad experiences from another open source >>> >> >>> community as >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't >>> go >>> >> the >>> >> >>> way >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct >>> community >>> >> which >>> >> >>> >>> >> > happens to operate under a similar open governance model. >>> >> >>> >>> >> >>> >> >>> >>> >> >>> >> >>> >>> >> There are some more radical and community building options as >>> >> well. >>> >> >>> Take >>> >> >>> >>> >> the subversion project as a precedent. With subversion, any >>> >> Apache >>> >> >>> >>> >> committer can request and receive a commit bit on some large >>> >> >>> fraction of >>> >> >>> >>> >> subversion. >>> >> >>> >>> >> >>> >> >>> >>> >> So why not take this a bit further and give every parquet >>> >> committer >>> >> >>> a >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class >>> >> committers in >>> >> >>> >>> Arrow? >>> >> >>> >>> >> Possibly even make it policy that every Parquet committer who >>> >> asks >>> >> >>> will >>> >> >>> >>> be >>> >> >>> >>> >> given committer status in Arrow. >>> >> >>> >>> >> >>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet >>> >> committers >>> >> >>> >>> can't be >>> >> >>> >>> >> worried at that point whether their patches will get merged; >>> >> they >>> >> >>> can >>> >> >>> >>> just >>> >> >>> >>> >> merge them. Arrow shouldn't worry much about inviting in the >>> >> >>> Parquet >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on >>> parquet so >>> >> >>> why not >>> >> >>> >>> >> invite them in? >>> >> >>> >>> >> >>> >> >>> >>> >>> >> >>> >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> regards, >>> >> >> Deepak Majeti >>> >> >>> > >>> > >>> > -- >>> > regards, >>> > Deepak Majeti >>>