Re: Recruiting more maintainers for Apache Arrow
For what it's worth, this email thread and your summary writeup, Wes, are a significant call to action on their own. I've been passive, not by choice, but by policy. Given the significance and need of this project, I'll see what I can do on my side. It will be at least a week given the US holiday. Donald E. Foss > On Jun 30, 2018, at 2:15 PM, Marco Neumann > wrote: > > Hey, > > first of all, thanks a lot for your, Uwes, the mergers and contributors > work. Now, to the maintainer problem: > > # Arrow as "a library" > One thing that makes Arrow special is that it is not a single, but many > libraries (one for each language) and many of them are not only a > binding to a C/C++ lib, but partly a complete re-implementation of the > protocol, e.g.: > > - C++: one core, but also contains Python specialties > - Java: another core > - Rust: yet another core > - Python: a binding to C++ but also a lot more stuff because of Pandas > ... > > And you two are maintaining all of them and I doubt that you have the > capacities and knowledge to do this at the desired level of quality > (which is natural, not a personal issue or offense). So this I would > call "pseudo-maintenance", since you're solely the gatekeeper that does > some shallow reviewing and has the burden to do the housekeeping and > the merging. So why accepting these language bindings in the first > place without bringing a core maintainer in place? For example, let's > say someone proposes a binding to Haskell now. That should not be > accepted as part of the official Apache implementation without a > dedicated maintainer (ideally the PR-author would be that person, but > there may others who step up). > > Right now, it might be too late to remove some of the incomplete / WIP > implementations that don't have a core maintainer though. > > # GitHub > Another special thing to consider is that Arrow is (ab)using GitHub as > a code hosting platform. Even as a contributor, this has obvious bad > uncool consequences: > > - you have yet another issue hosting system to log in > - there is yet another information channel to keep track of (this ML > for example, which has a semi-informative web interface telling you > can only login using Google but does not tell you how to subscribe to > the list) > - links to issues don't work in the known magic way > - you're merging the PRs by closing them; which is by all means a not > very nice way because it does not reflect the contributors work in > the project overview and personal profiles, but exactly this is a > large part of the GitHub community (btw: merging PRs without using > GitHubs merge button IS possible as bors/bors-ng proof) > > So as a potential maintainer, this is already a bumper, since I know > that there are things less confortable then the system I would get from > any normal GitHub or Gitlab project. > > I'm not really sure how to solve this or if it should be solved (read > about the laziness aspect in "Contribution VS Maintenance" below) > > # Time / Payment > Yes, this is indeed a big issue. From what I can tell from the open > source projects I was involved in is that for large contributor crowds, > you normally have full/half-time positions in place for the core > maintainer (look at the Mozilla projects, the Blender Foundation, Gnome > / Red Hat). So at one point I think maintaining isn't a part time / > hobby thing anymore (w/o downgrading the hard work of Hobby- > contributors, in contrast). I don't have a link at hand, but I recall > some discussion about GitHub and it's importance for hiring (since it > it acts as a CV) after MS bought it, and some of the responses are > "doing all this work in your free time is a privilege of wealthy, > mostly-white men", which without signing this statement in this really > bare form already shows a problem of open source world. > > # Contribution VS Maintenance > The very "nice" thing about patch/PR contribution is that you do your > work and then you can walk away and it's the maintainers problem to > release the artifact, upgrade/migrate your code and ensure that the > tests you've written never break. It's comfortable. Being a maintainer > means all the opposite things. And in the end, you get blamed for not > supporting certain features (see the open source paragraph here https:/ > /blog.ghost.org/5/ ) or for security disasters (remember the OpenSSL > disaster). > > I think together with the previous point this means, we have to get > companies to pay for that work, and not just dump their features to an > OSS repo. > > # Path to Maintainership > So I think (from my narrow point of view!) that many people expect that > the path from "outsider" to "maintainer" takes the route over "a lot of > patch/PR contributions". If I'm reading your mail right, that is not > necessarily the case for Apache projects and I think that's great. The > "review PRs" path sounds great, but I think GitHub or any platform I'm > aware don't do a
Re: Recruiting more maintainers for Apache Arrow
hi Antoine, On Sat, Jun 30, 2018 at 2:35 PM, Antoine Pitrou wrote: > > Hi Wes, > >> I'm not sure what's the best way to address this problem. The quality >> of our code review has declined at times as we struggle to keep up >> with the flow of patches -- I don't think this is good. Having the >> patch queue pile up isn't great either. > > I'd like to do more reviews but due to the breadth of topics and > technologies in our code base I don't feel competent for many of the PRs > that are being posted. As one of the top 3 maintainers (by # of patches merged) in 2018, and the newest committer, there is no need to apologize for anything. > > For example, on a Rust PR I may do a brief review of concepts, APIs or > general cleanliness, but not much more. > >> Personally, I'm having a >> difficult time balancing project maintenance and patch authoring, >> particularly in the last 6 months. > > I think it's ok to spend most of your time on reviewing and project > maintenance. That's what I will do for a while, but honestly it is creating a lot of stress for me because we are not progressing very quickly towards a feature-complete iteration of the columnar format and the ability to do a 1.0 release. If I were able to spend more time writing patches, I feel I could put more pressure on the project to reach that point sooner. > >> Any thoughts about how we can grow the maintainership? Somehow we need >> to reach ~5-6 core maintainers over the next year. > > Or more of them, if we want all topics to be covered by at least 1-2 > maintainers. Agreed. As an example, Kou has done an excellent job maintaining the C/GLib subproject and has been super responsive dealing with debugging and packaging / release management issues. > > Regards > > Antoine.
Re: Recruiting more maintainers for Apache Arrow
hi Marco, some comments inline On Sat, Jun 30, 2018 at 2:15 PM, Marco Neumann wrote: > Hey, > > first of all, thanks a lot for your, Uwes, the mergers and contributors > work. Now, to the maintainer problem: > > # Arrow as "a library" > One thing that makes Arrow special is that it is not a single, but many > libraries (one for each language) and many of them are not only a > binding to a C/C++ lib, but partly a complete re-implementation of the > protocol, e.g.: > > - C++: one core, but also contains Python specialties > - Java: another core > - Rust: yet another core > - Python: a binding to C++ but also a lot more stuff because of Pandas > ... > > And you two are maintaining all of them and I doubt that you have the > capacities and knowledge to do this at the desired level of quality > (which is natural, not a personal issue or offense). So this I would > call "pseudo-maintenance", since you're solely the gatekeeper that does > some shallow reviewing and has the burden to do the housekeeping and > the merging. So why accepting these language bindings in the first > place without bringing a core maintainer in place? For example, let's > say someone proposes a binding to Haskell now. That should not be > accepted as part of the official Apache implementation without a > dedicated maintainer (ideally the PR-author would be that person, but > there may others who step up). The most development activity, and where we have the most need of help, is in C++ and Python. The other area is in dev/CI infrastructure and release management. We're falling behind on implementation and design work involving Java-land (I have been trying for about a year to hammer down an improved Interval type), but that's a separate problem. We are about to reach a point (particularly if Gandiva becomes part of Apache Arrow) where more languages will become dependent on the C++ library. This makes the need for more C++ maintainers even more urgent. I think the other libraries have done a good job of self-managing their code (e.g. Java, JavaScript), and I frequently merge patches when there is a +1 or some other consensus. > > Right now, it might be too late to remove some of the incomplete / WIP > implementations that don't have a core maintainer though. Honestly, the incomplete/WIP projects are not causing any maintenance burden. It's the main projects and their development lifecycle that is creating a lot of work. > > # GitHub > Another special thing to consider is that Arrow is (ab)using GitHub as > a code hosting platform. Even as a contributor, this has obvious bad > uncool consequences: I think these issues are red herrings. If maintainers are more motivated by the gamification of their open source contributions rather than the health and success of the proejct, I really question how valuable of a maintainer they are. > > - you have yet another issue hosting system to log in I strongly dispute the notion that using JIRA is a deterrent to maintainers. If anyone, it's a filter for drive-by contributors and unserious maintainers. I say this as the project's primary JIRA gardener. > - there is yet another information channel to keep track of (this ML > for example, which has a semi-informative web interface telling you > can only login using Google but does not tell you how to subscribe to > the list) > - links to issues don't work in the known magic way I think these things might deter passers-by, but I don't see why they would be a problem for someone who is concerned with the health of the project. As the primary maintainer of the project, these things don't impact me in any way. > - you're merging the PRs by closing them; which is by all means a not > very nice way because it does not reflect the contributors work in > the project overview and personal profiles, but exactly this is a > large part of the GitHub community (btw: merging PRs without using > GitHubs merge button IS possible as bors/bors-ng proof) For each patch you contribute, you get one contribution "point" on GitHub, but it won't show that you have a PR "merged". I don't see why we should have to comply with GitHub's gamified approach to open source. > > So as a potential maintainer, this is already a bumper, since I know > that there are things less confortable then the system I would get from > any normal GitHub or Gitlab project. > > I'm not really sure how to solve this or if it should be solved (read > about the laziness aspect in "Contribution VS Maintenance" below) I don't mean to be too dismissive of these concerns (they are common; people have a difficult time with change) -- I've been long critical of people concerned with their "GitHub High Score". See some writing on this from a while ago: http://wesmckinney.com/blog/github-open-source-contributions/ > > # Time / Payment > Yes, this is indeed a big issue. From what I can tell from the open > source projects I was involved in is that for large contributor crowds, > you
Re: Recruiting more maintainers for Apache Arrow
Hey, first of all, thanks a lot for your, Uwes, the mergers and contributors work. Now, to the maintainer problem: # Arrow as "a library" One thing that makes Arrow special is that it is not a single, but many libraries (one for each language) and many of them are not only a binding to a C/C++ lib, but partly a complete re-implementation of the protocol, e.g.: - C++: one core, but also contains Python specialties - Java: another core - Rust: yet another core - Python: a binding to C++ but also a lot more stuff because of Pandas ... And you two are maintaining all of them and I doubt that you have the capacities and knowledge to do this at the desired level of quality (which is natural, not a personal issue or offense). So this I would call "pseudo-maintenance", since you're solely the gatekeeper that does some shallow reviewing and has the burden to do the housekeeping and the merging. So why accepting these language bindings in the first place without bringing a core maintainer in place? For example, let's say someone proposes a binding to Haskell now. That should not be accepted as part of the official Apache implementation without a dedicated maintainer (ideally the PR-author would be that person, but there may others who step up). Right now, it might be too late to remove some of the incomplete / WIP implementations that don't have a core maintainer though. # GitHub Another special thing to consider is that Arrow is (ab)using GitHub as a code hosting platform. Even as a contributor, this has obvious bad uncool consequences: - you have yet another issue hosting system to log in - links to issues don't work in the known magic way - you're merging the PRs by closing them; which is by all means a not very nice way because it does not reflect the contributors work in the project overview and personal profiles, but exactly this is a large part of the GitHub community (btw: merging PRs without using GitHubs merge button IS possible as bors/bors-ng proof) So as a potential maintainer, this is already a bumper, since I know that there are things less confortable then the system I would get from any normal GitHub or Gitlab project. I'm not really sure how to solve this or if it should be solved (read about the laziness aspect in "Contribution VS Maintenance" below) # Time / Payment Yes, this is indeed a big issue. From what I can tell from the open source projects I was involved in is that for large contributor crowds, you normally have full/half-time positions in place for the core maintainer (look at the Mozilla projects, the Blender Foundation, Gnome / Red Hat). So at one point I think maintaining isn't a part time / hobby thing anymore (w/o downgrading the hard work of Hobby- contributors, in contrast). I don't have a link at hand, but I recall some discussion about GitHub and it's importance for hiring (since it it acts as a CV) after MS bought it, and some of the responses are "doing all this work in your free time is a privilege of wealthy, mostly-white men", which without signing this statement in this really bare form already shows a problem of open source world. # Contribution VS Maintenance The very "nice" thing about patch/PR contribution is that you do your work and then you can walk away and it's the maintainers problem to release the artifact, upgrade/migrate your code and ensure that the tests you've written never break. It's comfortable. Being a maintainer means all the opposite things. And in the end, you get blamed for not supporting certain features (see the open source paragraph here https:/ /blog.ghost.org/5/ ) or for security disasters (remember the OpenSSL disaster). I think together with the previous point this means, we have to get companies to pay for that work, and not just dump their features to an OSS repo. # Path to Maintainership So I think (from my narrow point of view!) that many people expect that the path from "outsider" to "maintainer" takes the route over "a lot of patch/PR contributions". If I'm reading your mail right, that is not necessarily the case for Apache projects and I think that's great. The "review PRs" path sounds great, but I think GitHub or any platform I'm aware don't do a good job in getting people to do so. I mean, I see a PR and a can leave a review, but for me it is not really clear which consequences this have (naturally, random people don't have a veto on changes). So I can jump in when I think something is wrong, but I cannot approve a PR. This makes sense, but it poses the question of "how?!". I mean, it is pretty clear on how to become a patch/PR contributor, but it is not clear on how to become a maintainer, at least not in an easy way. (I'm sure it's written down somewhere). So, overall I think a clear Call for Action at the top of the README could help. Like "Hey, we're looking for maintainers, you could start by reviewing some PRs and after some reviews maintainers will just be the last gatekeeper and after some more time,
Re: Recruiting more maintainers for Apache Arrow
Hi Wes, > I'm not sure what's the best way to address this problem. The quality > of our code review has declined at times as we struggle to keep up > with the flow of patches -- I don't think this is good. Having the > patch queue pile up isn't great either. I'd like to do more reviews but due to the breadth of topics and technologies in our code base I don't feel competent for many of the PRs that are being posted. For example, on a Rust PR I may do a brief review of concepts, APIs or general cleanliness, but not much more. > Personally, I'm having a > difficult time balancing project maintenance and patch authoring, > particularly in the last 6 months. I think it's ok to spend most of your time on reviewing and project maintenance. > Any thoughts about how we can grow the maintainership? Somehow we need > to reach ~5-6 core maintainers over the next year. Or more of them, if we want all topics to be covered by at least 1-2 maintainers. Regards Antoine.
Re: Recruiting more maintainers for Apache Arrow
Hey, first of all, thanks a lot for your, Uwes, the mergers and contributors work. Now, to the maintainer problem: # Arrow as "a library" One thing that makes Arrow special is that it is not a single, but many libraries (one for each language) and many of them are not only a binding to a C/C++ lib, but partly a complete re-implementation of the protocol, e.g.: - C++: one core, but also contains Python specialties - Java: another core - Rust: yet another core - Python: a binding to C++ but also a lot more stuff because of Pandas ... And you two are maintaining all of them and I doubt that you have the capacities and knowledge to do this at the desired level of quality (which is natural, not a personal issue or offense). So this I would call "pseudo-maintenance", since you're solely the gatekeeper that does some shallow reviewing and has the burden to do the housekeeping and the merging. So why accepting these language bindings in the first place without bringing a core maintainer in place? For example, let's say someone proposes a binding to Haskell now. That should not be accepted as part of the official Apache implementation without a dedicated maintainer (ideally the PR-author would be that person, but there may others who step up). Right now, it might be too late to remove some of the incomplete / WIP implementations that don't have a core maintainer though. # GitHub Another special thing to consider is that Arrow is (ab)using GitHub as a code hosting platform. Even as a contributor, this has obvious bad uncool consequences: - you have yet another issue hosting system to log in - there is yet another information channel to keep track of (this ML for example, which has a semi-informative web interface telling you can only login using Google but does not tell you how to subscribe to the list) - links to issues don't work in the known magic way - you're merging the PRs by closing them; which is by all means a not very nice way because it does not reflect the contributors work in the project overview and personal profiles, but exactly this is a large part of the GitHub community (btw: merging PRs without using GitHubs merge button IS possible as bors/bors-ng proof) So as a potential maintainer, this is already a bumper, since I know that there are things less confortable then the system I would get from any normal GitHub or Gitlab project. I'm not really sure how to solve this or if it should be solved (read about the laziness aspect in "Contribution VS Maintenance" below) # Time / Payment Yes, this is indeed a big issue. From what I can tell from the open source projects I was involved in is that for large contributor crowds, you normally have full/half-time positions in place for the core maintainer (look at the Mozilla projects, the Blender Foundation, Gnome / Red Hat). So at one point I think maintaining isn't a part time / hobby thing anymore (w/o downgrading the hard work of Hobby- contributors, in contrast). I don't have a link at hand, but I recall some discussion about GitHub and it's importance for hiring (since it it acts as a CV) after MS bought it, and some of the responses are "doing all this work in your free time is a privilege of wealthy, mostly-white men", which without signing this statement in this really bare form already shows a problem of open source world. # Contribution VS Maintenance The very "nice" thing about patch/PR contribution is that you do your work and then you can walk away and it's the maintainers problem to release the artifact, upgrade/migrate your code and ensure that the tests you've written never break. It's comfortable. Being a maintainer means all the opposite things. And in the end, you get blamed for not supporting certain features (see the open source paragraph here https:/ /blog.ghost.org/5/ ) or for security disasters (remember the OpenSSL disaster). I think together with the previous point this means, we have to get companies to pay for that work, and not just dump their features to an OSS repo. # Path to Maintainership So I think (from my narrow point of view!) that many people expect that the path from "outsider" to "maintainer" takes the route over "a lot of patch/PR contributions". If I'm reading your mail right, that is not necessarily the case for Apache projects and I think that's great. The "review PRs" path sounds great, but I think GitHub or any platform I'm aware don't do a good job in getting people to do so. I mean, I see a PR and a can leave a review, but for me it is not really clear which consequences this have (naturally, random people don't have a veto on changes). So I can jump in when I think something is wrong, but I cannot approve a PR. This makes sense, but it poses the question of "how?!". I mean, it is pretty clear on how to become a patch/PR contributor, but it is not clear on how to become a maintainer, at least not in an easy way. (I'm sure it's written down somewhere). So, overall I think a clear Call
Re: Recruiting more maintainers for Apache Arrow
One of the things I’ve started doing in the Spark project is live code reviews to encourage other folks to get involved in the review process and help it seem more achievable (see https://www.youtube.com/playlist?list=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw ) . Another that I think has helped us is making it clear one of the steps to becoming a committer (something often valued by corporate employers) is being involved in the review process. I don’t know how much this applies, but some of the committees have also found our PR dashboard which gives a view of PRs that are ready to merge and organized by area to be helpful (see http://spark-prs.appspot.com ). YMMV of course, but this is a problem with I spend a lot of time thinking about (only sometimes with answers) so really interested to see where the discussion goes. I gave a somewhat related talk: (Dealing with Contributor Overload) at FOSS backstage recently https://youtu.be/XS8cTLAuHUw I’m not really all that involved with the Arrow project but if folks would be open to it I’d be happy to add it to my list of projects I do livestream reviews with. On Sat, Jun 30, 2018 at 7:58 AM Wes McKinney wrote: > hi folks, > > Arrow has grown by leaps and bounds over the last 2.5 years. We are > approaching our 2000th patch and on track to surpass 200 unique > contributors by year end. > > All this contribution growth is great, but it has a hidden cost: the > maintenance. The burden of maintaining the project: particularly > reviewing and merging patches, has fallen on a very small number of > people. From the commit logs, we can see how many patches each > committer has merged: > > $ git shortlog -csn d5aa7c46692474376a3c31704cfc4783c86338f2..master > 1289 Wes McKinney >268 Uwe L. Korn > 74 Korn, Uwe > 54 Antoine Pitrou > 52 Julien Le Dem > 39 Philipp Moritz > 18 Kouhei Sutou > 18 Steven Phillips > 13 Bryan Cutler > 11 Jacques Nadeau > 10 Phillip Cloud > 8 Brian Hulette > 5 Robert Nishihara > 5 adeneche > 4 GitHub > 3 Sidd > 3 siddharth > 1 AbdelHakim Deneche > 1 Your Name Here > > So Uwe and I have merged ~84% of the patches in the project so far. > This isn't a completely accurate reflection of the maintainer burden, > since many others contribute to code reviews and other aspects of > patch maintenance, and you have to be a committer to earn a place on > this list. > > I'm not sure what's the best way to address this problem. The quality > of our code review has declined at times as we struggle to keep up > with the flow of patches -- I don't think this is good. Having the > patch queue pile up isn't great either. Personally, I'm having a > difficult time balancing project maintenance and patch authoring, > particularly in the last 6 months. > > Unfortunately, many people believe that writing patches is the primary > mode of contribution to an open source project. Apache projects > explicitly state that non-patch contributions are valued in earning > karma (committership and PMC membership). We're starting to have more > corporate contributors come out of the woodwork, and while it's great > for contributors to be paid to write patches for the project, they are > rarely given the time and space to contribute meaningfully to > maintenance. > > Any thoughts about how we can grow the maintainership? Somehow we need > to reach ~5-6 core maintainers over the next year. > > Thanks, > Wes > -- Twitter: https://twitter.com/holdenkarau
[jira] [Created] (ARROW-2771) [JS] Add row proxy object accessor
Brian Hulette created ARROW-2771: Summary: [JS] Add row proxy object accessor Key: ARROW-2771 URL: https://issues.apache.org/jira/browse/ARROW-2771 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Brian Hulette Assignee: Brian Hulette The {{Table}} class would be much easier to interact with if it returned familiar Javascript objects representing a row. As Jeff Heer [demonstrated|https://beta.observablehq.com/@jheer/from-apache-arrow-to-javascript-objects] it's possible to create JS Proxy objects that read directly from Arrow memory. We should generate these types of objects in {{Table.get}} and in the {{Table}} iterator. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2770) [Python] Account for conda-forge compiler migration in conda recipes
Wes McKinney created ARROW-2770: --- Summary: [Python] Account for conda-forge compiler migration in conda recipes Key: ARROW-2770 URL: https://issues.apache.org/jira/browse/ARROW-2770 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Wes McKinney Fix For: 0.10.0 See https://github.com/conda-forge/arrow-cpp-feedstock/pull/53 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Recruiting more maintainers for Apache Arrow
hi folks, Arrow has grown by leaps and bounds over the last 2.5 years. We are approaching our 2000th patch and on track to surpass 200 unique contributors by year end. All this contribution growth is great, but it has a hidden cost: the maintenance. The burden of maintaining the project: particularly reviewing and merging patches, has fallen on a very small number of people. From the commit logs, we can see how many patches each committer has merged: $ git shortlog -csn d5aa7c46692474376a3c31704cfc4783c86338f2..master 1289 Wes McKinney 268 Uwe L. Korn 74 Korn, Uwe 54 Antoine Pitrou 52 Julien Le Dem 39 Philipp Moritz 18 Kouhei Sutou 18 Steven Phillips 13 Bryan Cutler 11 Jacques Nadeau 10 Phillip Cloud 8 Brian Hulette 5 Robert Nishihara 5 adeneche 4 GitHub 3 Sidd 3 siddharth 1 AbdelHakim Deneche 1 Your Name Here So Uwe and I have merged ~84% of the patches in the project so far. This isn't a completely accurate reflection of the maintainer burden, since many others contribute to code reviews and other aspects of patch maintenance, and you have to be a committer to earn a place on this list. I'm not sure what's the best way to address this problem. The quality of our code review has declined at times as we struggle to keep up with the flow of patches -- I don't think this is good. Having the patch queue pile up isn't great either. Personally, I'm having a difficult time balancing project maintenance and patch authoring, particularly in the last 6 months. Unfortunately, many people believe that writing patches is the primary mode of contribution to an open source project. Apache projects explicitly state that non-patch contributions are valued in earning karma (committership and PMC membership). We're starting to have more corporate contributors come out of the woodwork, and while it's great for contributors to be paid to write patches for the project, they are rarely given the time and space to contribute meaningfully to maintenance. Any thoughts about how we can grow the maintainership? Somehow we need to reach ~5-6 core maintainers over the next year. Thanks, Wes
[jira] [Created] (ARROW-2769) [Python] Deprecate and rename add_metadata methods
Krisztian Szucs created ARROW-2769: -- Summary: [Python] Deprecate and rename add_metadata methods Key: ARROW-2769 URL: https://issues.apache.org/jira/browse/ARROW-2769 Project: Apache Arrow Issue Type: Improvement Reporter: Krisztian Szucs Deprecate and replace `pyarrow.Field.add_metadata` (and other likely named methods) with replace_metadata, set_metadata or with_metadata. Knowing Spark's immutable API, I would have chosen with_metadata but I guess this is probably not what the average Python user would expect as naming. -- This message was sent by Atlassian JIRA (v7.6.3#76005)