Here is the thread we voted on at the time: https://lists.apache.org/thread/gkvbm6yyly1r4cg3f6xtnqkjz6ogn6o2 and the thread calling the result: https://lists.apache.org/thread/76wzx2lsbwjl363bg066g8kdsocd03rw
This thread calls for giving access of Parquet committers to this part of the repo and contribute to this code base. Asking for good collaboration between Parquet and Arrow committers here. There was and still is a lot of overlap between the parquet and arrow committers. The access mechanisms are tied to repos, so it does not make this easy. At the time the dependency management in the C++ repos (Parquet and Arrow) and the changing APIs made things difficult, which prompted moving those two in the same repo. Now that those APIs are more stable I do think splitting the repos would be easier. Arrow is not a monorepo anymore like it was at the time. That would clarify things from an access control perspective. On Thu, May 16, 2024 at 6:41 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > > . Warranted or not, there is still a perception among some that parquet > is closely tied to the Spark / Hadoop ecosystems, > > It certainly doesn't help that https://parquet.apache.org explicitly says > it is for Hadoop: "Apache Parquet is a columnar storage format available to > any project in the Hadoop ecosystem, regardless of the choice of data > processing framework, data model or programming language." right on the > front page. > > Shameless plug for a committer to merge my PR[1] to the site that makes it > clearer parquet is more general. > > Andrew > > [1]" https://github.com/apache/parquet-site/pull/59 > > On Thu, May 16, 2024 at 9:37 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > I can't speak for other's motivations, but for me it is about better > > communicating parquet as a format specification, with a number of > > implementations in different languages, as opposed to a specific Java > > implementation. Perhaps something closer to the approach of arrow, where > > there is a family of first-party implementations, across a number of > > different languages, that all work together to ensure interoperability, > > evolve the specification, etc... Warranted or not, there is still a > > perception among some that parquet is closely tied to the Spark / Hadoop > > ecosystems, and only useful as a means of interoperating with said > > ecosystems. > > > > On 16/05/2024 14:11, Rok Mihevc wrote: > > > What are the benefits of a parquet implementation being part of Apache > > > Parquet vs another Apache project vs something else entirely? > > > Being part of Apache org? Branding? Voting rights? > > > If motivations are clear, solutions might be more readily apparent. > > > > > > Rok > > > > > > On Thu, May 16, 2024 at 2:36 PM Raphael Taylor-Davies > > > <r.taylordav...@googlemail.com.invalid> wrote: > > > > > >> I'm curious where the other arrow parquet implementations fit into > this, > > >> if at all? For context, the original Rust implementation was largely > the > > >> work of Chao Sun, who I believe to be a parquet PMC member, but it was > > >> then donated to the arrow project, and has primarily been developed > and > > >> maintained by individuals affiliated with the arrow project since > then, > > >> myself included. I'm not suggesting all parquet implementations > > >> necessarily need to be governed by the parquet PMC, but perhaps what > > >> ever compromise we devise for parquet-cpp might equally be applied to > > >> the other parquet projects that fall under the arrow umbrella? > > >> > > >> Kind Regards, > > >> > > >> Raphael > > >> > > >> On 16/05/2024 13:26, Uwe L. Korn wrote: > > >>> I would actually consider someone who contributes to both communities > > at > > >> the same time to be a worthwhile addition to both projects. In my > active > > >> years, we have mostly voted people into both projects; the order was > not > > >> clear, though. > > >>> Being a committer/PMC means that you want to bring the community > around > > >> a project forward in the Apache way (with parquet-cpp this is given as > > it > > >> is part of the parquet community and also still in a project that is > > >> residing within the Apache org). > > >>>> he told me that the contribution to > > >>>> parquet-cpp is no longer considered when promoting committers to > > >>>> Apache Parquet PMC. > > >>> As a Parquet PMC, I would strongly object to that and would be > > >> supportive of also making them a Parquet committer/PMC. > > >>> Best > > >>> Uwe > > >>> > > >>> On Thu, May 16, 2024, at 2:19 PM, Gang Wu wrote: > > >>>> Hi, > > >>>> > > >>>> I share the same feeling with Antoine that parquet-cpp seems to be > > fully > > >>>> governed by Apache Arrow PMC, not the Apache Parquet PMC. I have > > >>>> once discussed this with Xinli and he told me that the contribution > to > > >>>> parquet-cpp is no longer considered when promoting committers to > > >>>> Apache Parquet PMC. > > >>>> > > >>>> Best, > > >>>> Gang > > >>>> > > >>>> On Thu, May 16, 2024 at 4:29 PM Antoine Pitrou <anto...@python.org> > > >> wrote: > > >>>>> On Thu, 16 May 2024 10:08:42 +0200 > > >>>>> "Uwe L. Korn" <uw...@xhochy.com> wrote: > > >>>>>> On Tue, May 14, 2024, at 6:30 PM, Antoine Pitrou wrote: > > >>>>>>> AFAIK, the only Parquet implementation under the Apache Parquet > > >> project > > >>>>>>> is parquet-mr :-) > > >>>>>> This is not true. The parquet-cpp that resides in the arrow > > repository > > >>>>> is still controlled by the Apache Parquet PMC. Back then, we only > > >> merged > > >>>>> the codebases but kept control of it with the Apache Parquet > > project. I > > >>>>> know, it is hard to understand, but at least I have never seen a > vote > > >> that > > >>>>> would move it out of the Apache Parquet's project "control". > > >>>>> > > >>>>> Ahah. Unfortunately, this doesn't match actual community practices. > > For > > >>>>> example, when it is decided to give (Arrow) commit rights to a > > frequent > > >>>>> Parquet C++ contributor, that decision is made among the Arrow PMC, > > not > > >>>>> the Parquet PMC. > > >>>>> > > >>>>> Perhaps there would be value in aligning the legal situation on the > > >>>>> _de facto_ situation? > > >>>>> > > >>>>> Regards > > >>>>> > > >>>>> Antoine. > > >>>>> > > >>>>> > > >>>>>> Best > > >>>>>> Uwe > > >>>>>>> On Tue, 14 May 2024 10:58:58 +0200 > > >>>>>>> Rok Mihevc <rok.mih...@gmail.com> wrote: > > >>>>>>>> Second Raphael's point. > > >>>>>>>> Would it be reasonable to say specification change requires > > >>>>> implementation > > >>>>>>>> in two parquet implementations within Apache Parquet project? > > >>>>>>>> > > >>>>>>>> Rok > > >>>>>>>> > > >>>>>>>> On Tue, May 14, 2024 at 10:50 AM Gang Wu < > > >>>>> ustcwg-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > >>>>>>>>> IMHO, it looks more reasonable if a reference implementation is > > >>>>> required > > >>>>>>>>> to support most (not all) elements from the specification. > > >>>>>>>>> > > >>>>>>>>> Another question is: should we discuss (and vote for) each > > >> candidate > > >>>>>>>>> one by one? We can start with parquet-mr which is most > well-known > > >>>>>>>>> implementation. > > >>>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> Gang > > >>>>>>>>> > > >>>>>>>>> On Tue, May 14, 2024 at 4:41 PM Raphael Taylor-Davies > > >>>>>>>>> <r.taylordav...@googlemail.com.invalid> wrote: > > >>>>>>>>> > > >>>>>>>>>> Potentially it would be helpful to flip the question around. > As > > >>>>> Andrew > > >>>>>>>>>> articulates, a reference implementation is required to > implement > > >>>>> all > > >>>>>>>>>> elements from the specification, and therefore the major > > >>>>> consequence of > > >>>>>>>>>> labeling parquet-mr thusly would be that any specification > > change > > >>>>> would > > >>>>>>>>>> have to be implemented within parquet-mr as part of the > > >>>>> standardisation > > >>>>>>>>>> process. It would be insufficient for it to be implemented in, > > for > > >>>>>>>>>> example, two of the parquet implementations maintained by the > > >>>>> arrow > > >>>>>>>>>> project. I personally think that would be a shame and likely > > >>>>> exclude > > >>>>>>>>>> many people who would otherwise be interested in evolving the > > >>>>> parquet > > >>>>>>>>>> specification, but think that is at the core of this question. > > >>>>>>>>>> > > >>>>>>>>>> Kind Regards, > > >>>>>>>>>> > > >>>>>>>>>> Raphael > > >>>>>>>>>> > > >>>>>>>>>> On 13/05/2024 20:55, Andrew Lamb wrote: > > >>>>>>>>>>> Question: Should we label parquet-mr or any other parquet > > >>>>>>>>> implementations > > >>>>>>>>>>> "reference" implications"? > > >>>>>>>>>>> > > >>>>>>>>>>> This came up as part of Vinoo's great PR to list different > > >>>>> parquet > > >>>>>>>>>>> reference implementations[1][2]. > > >>>>>>>>>>> > > >>>>>>>>>>> The term "reference implementation" often has an official > > >>>>> connotation. > > >>>>>>>>>> For > > >>>>>>>>>>> example the wikipedia definition is "a program that > implements > > >>>>> all > > >>>>>>>>>>> requirements from a corresponding specification. The > reference > > >>>>>>>>>>> implementation ... should be considered the "correct" > behavior > > >>>>> of any > > >>>>>>>>>> other > > >>>>>>>>>>> implementation of it."[3] > > >>>>>>>>>>> > > >>>>>>>>>>> Given the close association of parquet-mr to the parquet > > >>>>> standard, it > > >>>>>>>>> is > > >>>>>>>>>> a > > >>>>>>>>>>> natural candidate to label as "reference implementation." > > >>>>> However, it > > >>>>>>>>> is > > >>>>>>>>>>> not clear to me if there is consensus that it should be > thusly > > >>>>> labeled. > > >>>>>>>>>>> I have a strong opinion that a consensus on this question > would > > >>>>> be very > > >>>>>>>>>>> helpful. I don't actually have a strong opinion about the > > answer > > >>>>>>>>>>> > > >>>>>>>>>>> Andrew > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> [1]: > > >>>>> > > https://github.com/apache/parquet-site/pull/53#discussion_r1582882267 > > >>>>>>>>>>> [2]: > > >>>>> > > https://github.com/apache/parquet-site/pull/53#discussion_r1598283465 > > >>>>>>>>>>> [3]: https://en.wikipedia.org/wiki/Reference_implementation > > >>>>>>>>>>> > > >>>>> > > >>>>> > > >