Re: [VOTE] Designating maintainers for some Spark components
+1 (binding) Ankur http://www.ankurdave.com/ On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST.
Re: [VOTE] Designating maintainers for some Spark components
+1 (binding) For tickets which span across multiple components, will it need to be approved by all maintainers? For example, I'm working on the Python bindings of GraphX where code is added to both Python and GraphX modules. Thanks, -Kushal. On Thu, Nov 6, 2014 at 12:02 AM, Ankur Dave ankurd...@gmail.com wrote: +1 (binding) Ankur http://www.ankurdave.com/ On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST.
About implicit rddToPairRDDFunctions
I saw many people asked how to convert a RDD to a PairRDDFunctions. I would like to ask a question about it. Why not put the following implicit into pacakge object rdd or object rdd? implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null) = { new PairRDDFunctions(rdd) } If so, the converting will be automatic and not need to import org.apache.spark.SparkContext._ I tried to search some discussion but found nothing. Best Regards, Shixiong Zhu
JIRA + PR backlog
(Different topic, indulge me one more reply --) Yes the number of JIRAs/PRs closed is unprecedented too and that deserves big praise. The project has stuck to making all changes and discussion in this public process, which is so powerful. Adjusted for the sheer inbound volume, Spark is doing a much better job than other projects; I would not hold them up as a benchmark of 'good enough', to be honest. JIRA is usually under-managed and it's a pet issue of mine. My motive is that core contributor / committer time is very valuable and in short supply. On the one hand we could use lots more of it to shepherd changes and fix bugs in the core that only the very experienced can. On the other hand, you all deserve time to work on your own changes, build a business, etc. So I harp on JIRA management as a way to save time: - Merging PRs sooner means less rebasing / retesting - Bouncing back bad PRs/JIRAs early teaches everyone what's acceptable as a good PR/JIRA and prevents the noise in the first place - Resolving issues soon prevents duplicates from being filed - Recording 'WontFix' resolutions early heads off repeated discussion/work on out of scope topics I have more concrete ideas about managing this but it's not for now. For now, thanks for zapping some old JIRAs this morning and for endorsing the idea of staying on top of the issue list in general. As a long-time fan I hope I can help from the sidelines by also closing JIRAs I'm all but certain are stale, and review minor PRs to clear the way for maintainers to take on the more important work. On Thu, Nov 6, 2014 at 7:21 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Several people asked about having maintainers review the PR queue for their modules regularly, and I like that idea. We have a new tool now to help with that in https://spark-prs.appspot.com. In terms of the set of open PRs itself, it is large but note that there are also 2800 *closed* PRs, which means we close the majority of PRs (and I don't know the exact stats but I'd guess that 90% of those are accepted and merged). I think one problem is that with GitHub, people often develop something as a PR and have a lot of discussion on there (including whether we even want the feature). I recently updated our how to contribute page to encourage opening a JIRA and having discussions on the dev list first, but I do think we need to be faster with closing ones that we don't have a plan to merge. Note that Hadoop, Hive, HBase, etc also have about 300 issues each in the patch available state, so this is some kind of universal constant :P. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
+1 overall also +1 to Sandy's suggestion to getting build maintainers as well. On Wed, Nov 5, 2014 at 7:57 PM, Sandy Ryza sandy.r...@cloudera.com wrote: This seems like a good idea. An area that wasn't listed, but that I think could strongly benefit from maintainers, is the build. Having consistent oversight over Maven, SBT, and dependencies would allow us to avoid subtle breakages. Component maintainers have come up several times within the Hadoop project, and I think one of the main reasons the proposals have been rejected is that, structurally, its effect is to slow down development. As you mention, this is somewhat mitigated if being a maintainer leads committers to take on more responsibility, but it might be worthwhile to draw up more specific ideas on how to combat this? E.g. do obvious changes, doc fixes, test fixes, etc. always require a maintainer? -Sandy On Wed, Nov 5, 2014 at 5:36 PM, Michael Armbrust mich...@databricks.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add,
Re: [VOTE] Designating maintainers for some Spark components
+1 (binding) On Thu, Nov 6, 2014 at 4:02 PM, Ankur Dave ankurd...@gmail.com wrote: +1 (binding) Ankur http://www.ankurdave.com/ On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST.
Re: [VOTE] Designating maintainers for some Spark components
Matei, I saw that you're listed as a maintainer for ~6 different subcomponents, and on over half of those, you're only the 2nd person. My concern is that you would be stretched thin and maybe wouldn't be able to work as a back up on all of those subcomponents. Are you planning on adding more maintainers for each subcomponent? I think it would be good to have 2 regulars + backups for each. RJ On Thu, Nov 6, 2014 at 8:48 AM, Jason Dai jason@gmail.com wrote: +1 (binding) On Thu, Nov 6, 2014 at 4:02 PM, Ankur Dave ankurd...@gmail.com wrote: +1 (binding) Ankur http://www.ankurdave.com/ On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST. -- em rnowl...@gmail.com c 954.496.2314
Re: [VOTE] Designating maintainers for some Spark components
+1. Tom On Wednesday, November 5, 2014 9:21 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen from the most active contributors to those modules. - Spark core public API: Matei, Patrick, Reynold - Job scheduler: Matei, Kay, Patrick - Shuffle and network: Reynold, Aaron, Matei - Block manager: Reynold, Aaron - YARN: Tom, Andrew Or - Python: Josh, Matei - MLlib: Xiangrui, Matei - SQL: Michael, Reynold - Streaming: TD, Matei - GraphX: Ankur, Joey, Reynold I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST. Matei
Re: [VOTE] Designating maintainers for some Spark components
+1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen from the most active contributors to those modules. - Spark core public API: Matei, Patrick, Reynold - Job scheduler: Matei, Kay, Patrick - Shuffle and network: Reynold, Aaron, Matei - Block manager: Reynold, Aaron - YARN: Tom, Andrew Or - Python: Josh, Matei - MLlib: Xiangrui, Matei - SQL: Michael, Reynold - Streaming: TD, Matei - GraphX: Ankur, Joey, Reynold I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST. Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
+1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen from the most active contributors to those modules. - Spark core public API: Matei, Patrick, Reynold - Job scheduler: Matei, Kay, Patrick - Shuffle and network: Reynold, Aaron, Matei - Block manager: Reynold, Aaron - YARN: Tom, Andrew Or - Python: Josh, Matei - MLlib: Xiangrui, Matei - SQL: Michael, Reynold - Streaming: TD, Matei - GraphX: Ankur, Joey, Reynold I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST. Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
+1 (binding) — Sent from Mailbox On Thu, Nov 6, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com wrote: +1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen from the most active contributors to those modules. - Spark core public API: Matei, Patrick, Reynold - Job scheduler: Matei, Kay, Patrick - Shuffle and network: Reynold, Aaron, Matei - Block manager: Reynold, Aaron - YARN: Tom, Andrew Or - Python: Josh, Matei - MLlib: Xiangrui, Matei - SQL: Michael, Reynold - Streaming: TD, Matei - GraphX: Ankur, Joey, Reynold I'd like to formally call a [VOTE] on this model, to last 72 hours. The [VOTE] will end on Nov 8, 2014 at 6 PM PST. Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
+1 (binding). (our pull request browsing tool is open-source, by the way; contributions welcome: https://github.com/databricks/spark-pr-dashboard) On Thu, Nov 6, 2014 at 9:28 AM, Nick Pentreath nick.pentre...@gmail.com wrote: +1 (binding) — Sent from Mailbox On Thu, Nov 6, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com wrote: +1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide - Subversion: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html Finally, I wanted to list our current proposal for initial components and maintainers. It would be good to get feedback on other components we might add, but please note that personnel discussions (e.g. I don't think Matei should maintain *that* component) should only happen on the private list. The initial components were chosen to include all public APIs and the main core components, and the maintainers were chosen from the most active contributors to those modules. - Spark core public API: Matei, Patrick, Reynold - Job scheduler: Matei, Kay, Patrick - Shuffle and network: Reynold, Aaron, Matei - Block manager: Reynold, Aaron - YARN: Tom, Andrew Or - Python: Josh, Matei - MLlib: Xiangrui, Matei - SQL: Michael, Reynold - Streaming: TD, Matei - GraphX: Ankur, Joey, Reynold I'd like to formally call a [VOTE] on this
Re: [VOTE] Designating maintainers for some Spark components
Hi Matei, Good call on scaling the project itself. Identifying domain experts in different areas is a good thing. But I have some questions about the implementation. Here's my understanding of the proposal: (1) The PMC votes on a list of components and their maintainers. Changes to that list requires PMC approval. (2) No committer shall commit changes to a component without a +1 from a maintainer of that component. I see good reasons for #1, to help people navigate the project and identify expertise. For #2, I'd like to understand what problem it's trying to solve. Do we have rogue committers committing to areas that they don't know much about? If that's the case, we should address it directly, instead of adding new processes. To point out the obvious, it completely changes what committers means in Spark. Do we have clear promotion criteria from committer to maintainer? Is there a max number of maintainers per area Currently, as committers gains expertise in new areas, they could start reviewing code in those areas and give +1. This encourages more contributions and cross-component knowledge sharing. Under the new proposal, they now have to be promoted to maintainers first. That reduces our review bandwidth. Again, if there is a quality issue with code reviews, let's talk to those committers and help them do better. There are non-process ways to solve the problem. So I think we shouldn't require maintainer +1. I do like the idea of having explicit maintainers on a volunteer basis. These maintainers should watch their jira and PR traffic, and be very active in design API discussions. That leads to better consistency and long-term design choices. Cheers, bc On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
Re: [VOTE] Designating maintainers for some Spark components
Hi BC, The point is exactly to ensure that the maintainers have looked at each patch to that component and consider it to fit consistently into its architecture. The issue is not about rogue committers, it's about making sure that changes don't accidentally sneak in that we want to roll back, particularly because we have frequent releases and we guarantee API stability. This process is meant to ensure that whichever committer reviews a patch also forwards it to its maintainers. Note that any committer is able to review patches in any component. The maintainer sign-off is just a second requirement for some core components (central parts of the system and public APIs). But I expect that most maintainers will let others do the bulk of the reviewing and focus only on changes to the architecture or API. Ultimately, the core motivation is that the project has grown to the point where it's hard to expect every committer to have full understanding of every component. Some committers know a ton about systems but little about machine learning, some are algorithmic whizzes but may not realize the implications of changing something on the Python API, etc. This is just a way to make sure that a domain expert has looked at the areas where it is most likely for something to go wrong. Matei On Nov 6, 2014, at 10:53 AM, bc Wong bcwal...@cloudera.com wrote: Hi Matei, Good call on scaling the project itself. Identifying domain experts in different areas is a good thing. But I have some questions about the implementation. Here's my understanding of the proposal: (1) The PMC votes on a list of components and their maintainers. Changes to that list requires PMC approval. (2) No committer shall commit changes to a component without a +1 from a maintainer of that component. I see good reasons for #1, to help people navigate the project and identify expertise. For #2, I'd like to understand what problem it's trying to solve. Do we have rogue committers committing to areas that they don't know much about? If that's the case, we should address it directly, instead of adding new processes. To point out the obvious, it completely changes what committers means in Spark. Do we have clear promotion criteria from committer to maintainer? Is there a max number of maintainers per area Currently, as committers gains expertise in new areas, they could start reviewing code in those areas and give +1. This encourages more contributions and cross-component knowledge sharing. Under the new proposal, they now have to be promoted to maintainers first. That reduces our review bandwidth. Again, if there is a quality issue with code reviews, let's talk to those committers and help them do better. There are non-process ways to solve the problem. So I think we shouldn't require maintainer +1. I do like the idea of having explicit maintainers on a volunteer basis. These maintainers should watch their jira and PR traffic, and be very active in design API discussions. That leads to better consistency and long-term design choices. Cheers, bc On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who’s
Implementing TinkerPop on top of GraphX
All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Implementing TinkerPop on top of GraphX
cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Implementing TinkerPop on top of GraphX
I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of the Gremlin Search Language and a Parquet based graph store. I've been working out finalize some code details and putting together better code examples and documentation before I started telling people about it. But if you want to start looking at the code, I can answer any questions you have. And if you would like to contribute, I would really appreciate the help. Kyle On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin r...@databricks.com wrote: cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Implementing TinkerPop on top of GraphX
Great stuffs! I've got some thoughts about that, and I was wondering if it would be first interesting to have something like for spark-core (let's say): 0/ Core API offering basic (or advanced → HeLP) primitives 1/ catalyst optimizer for a text base system (SPARQL, Cypher, custom SQL3, whatnot) 2/ adequate DSL layer on top (à la LinQ) my2¢ aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Thu, Nov 6, 2014 at 8:48 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of the Gremlin Search Language and a Parquet based graph store. I've been working out finalize some code details and putting together better code examples and documentation before I started telling people about it. But if you want to start looking at the code, I can answer any questions you have. And if you would like to contribute, I would really appreciate the help. Kyle On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin r...@databricks.com wrote: cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Implementing TinkerPop on top of GraphX
I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then large distributed map-reduce operations. I had to hack the pipes to aggregate all of the data and pass a single object wrapping the GraphX RDDs down the pipes in a single go, rather then streaming it element by element. Just based on their description, Tinkerpop3 may be more amenable to the Spark platform. Kyle On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta kushal.da...@gmail.com wrote: What do you guys think about the Tinkerpop3 Gremlin interface? It has MapReduce to run Gremlin operators in a distributed manner and Giraph to execute vertex programs. The Tinkpop3 is better suited for GraphX. On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of the Gremlin Search Language and a Parquet based graph store. I've been working out finalize some code details and putting together better code examples and documentation before I started telling people about it. But if you want to start looking at the code, I can answer any questions you have. And if you would like to contribute, I would really appreciate the help. Kyle On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin r...@databricks.com wrote: cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: JIRA + PR backlog
I think better tooling will make it much easier for committers to trim the list of stale JIRA issues and PRs. Convenience enables action. - Spark PR Dashboard https://spark-prs.appspot.com/: Additional filters for stale PRs https://github.com/databricks/spark-pr-dashboard/issues/1 or PRs waiting on committer response would be great. - Stale Spark JIRA issues https://issues.apache.org/jira/issues/?filter=12329614jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20updated%20%3C%3D%20-90d%20ORDER%20BY%20updated%20ASC: This filter is sorted by the least recently updated issues first and can be filtered additionally by component. There are many, many easy wins in this filter. Nick On Thu, Nov 6, 2014 at 7:13 AM, Sean Owen so...@cloudera.com wrote: (Different topic, indulge me one more reply --) Yes the number of JIRAs/PRs closed is unprecedented too and that deserves big praise. The project has stuck to making all changes and discussion in this public process, which is so powerful. Adjusted for the sheer inbound volume, Spark is doing a much better job than other projects; I would not hold them up as a benchmark of 'good enough', to be honest. JIRA is usually under-managed and it's a pet issue of mine. My motive is that core contributor / committer time is very valuable and in short supply. On the one hand we could use lots more of it to shepherd changes and fix bugs in the core that only the very experienced can. On the other hand, you all deserve time to work on your own changes, build a business, etc. So I harp on JIRA management as a way to save time: - Merging PRs sooner means less rebasing / retesting - Bouncing back bad PRs/JIRAs early teaches everyone what's acceptable as a good PR/JIRA and prevents the noise in the first place - Resolving issues soon prevents duplicates from being filed - Recording 'WontFix' resolutions early heads off repeated discussion/work on out of scope topics I have more concrete ideas about managing this but it's not for now. For now, thanks for zapping some old JIRAs this morning and for endorsing the idea of staying on top of the issue list in general. As a long-time fan I hope I can help from the sidelines by also closing JIRAs I'm all but certain are stale, and review minor PRs to clear the way for maintainers to take on the more important work. On Thu, Nov 6, 2014 at 7:21 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Several people asked about having maintainers review the PR queue for their modules regularly, and I like that idea. We have a new tool now to help with that in https://spark-prs.appspot.com. In terms of the set of open PRs itself, it is large but note that there are also 2800 *closed* PRs, which means we close the majority of PRs (and I don't know the exact stats but I'd guess that 90% of those are accepted and merged). I think one problem is that with GitHub, people often develop something as a PR and have a lot of discussion on there (including whether we even want the feature). I recently updated our how to contribute page to encourage opening a JIRA and having discussions on the dev list first, but I do think we need to be faster with closing ones that we don't have a plan to merge. Note that Hadoop, Hive, HBase, etc also have about 300 issues each in the patch available state, so this is some kind of universal constant :P. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Using partitioning to speed up queries in Shark
Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT OVERWRITE TABLE part_customers_cached PARTITION(createday) SELECT id,email,dt_cr, to_date(dt_cr) as createday where dt_crunix_timestamp('2013-01-01 00:00:00') and dt_crunix_timestamp('2013-12-31 23:59:59'); set exec.dynamic.partition=true; set exec.dynamic.partition.mode=nonstrict; however when I run the following basic tests I get this type of performance [localhost:1] shark select count(*) from part_customers_cached where createday = '2014-08-01' and createday = '2014-12-06'; 37204 Time taken (including network latency): 3.131 seconds [localhost:1] shark SELECT count(*) from customers_cached where dt_crunix_timestamp('2013-08-01 00:00:00') and dt_crunix_timestamp('2013-12-06 23:59:59'); 37204 Time taken (including network latency): 1.538 seconds I'm running this on a cluster with one master and two slaves and was hoping that the partitioned table would be noticeably faster but it looks as though the partitioning has slowed things down... Is this the case, or is there some additional configuration that I need to do to speed things up? Best Wishes, Gordon
Re: Using partitioning to speed up queries in Shark
Did you mean to send this to the user list? This is the dev list, where we discuss things related to development on Spark itself. On Thu, Nov 6, 2014 at 5:01 PM, Gordon Benjamin gordon.benjami...@gmail.com wrote: Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT OVERWRITE TABLE part_customers_cached PARTITION(createday) SELECT id,email,dt_cr, to_date(dt_cr) as createday where dt_crunix_timestamp('2013-01-01 00:00:00') and dt_crunix_timestamp('2013-12-31 23:59:59'); set exec.dynamic.partition=true; set exec.dynamic.partition.mode=nonstrict; however when I run the following basic tests I get this type of performance [localhost:1] shark select count(*) from part_customers_cached where createday = '2014-08-01' and createday = '2014-12-06'; 37204 Time taken (including network latency): 3.131 seconds [localhost:1] shark SELECT count(*) from customers_cached where dt_crunix_timestamp('2013-08-01 00:00:00') and dt_crunix_timestamp('2013-12-06 23:59:59'); 37204 Time taken (including network latency): 1.538 seconds I'm running this on a cluster with one master and two slaves and was hoping that the partitioned table would be noticeably faster but it looks as though the partitioning has slowed things down... Is this the case, or is there some additional configuration that I need to do to speed things up? Best Wishes, Gordon
Wrong temp directory when compressing before sending text file to S3
We have some data that we are exporting from our HDFS cluster to S3 with some help from Spark. The final RDD command we run is: csvData.saveAsTextFile(s3n://data/mess/2014/11/dump-oct-30-to-nov-5-gzip, classOf[GzipCodec]) We have our 'spark.local.dir' set to our large ephemeral partition on each slave (on EC2), but with compression on an intermediate format seems to be written to /tmp/hadoop-root/s3. Is this a bug in Spark or are we missing a configuration property? It's a problem for us because the root disks on EC2 xls are small (~ 5GB).
Re: [VOTE] Designating maintainers for some Spark components
On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia matei.zaha...@gmail.com wrote: snip Ultimately, the core motivation is that the project has grown to the point where it's hard to expect every committer to have full understanding of every component. Some committers know a ton about systems but little about machine learning, some are algorithmic whizzes but may not realize the implications of changing something on the Python API, etc. This is just a way to make sure that a domain expert has looked at the areas where it is most likely for something to go wrong. Hi Matei, I understand where you're coming from. My suggestion is to solve this without adding a new process. In the example above, those algo whizzes committers should realize that they're touching the Python API, and loop in some Python maintainers. Those Python maintainers would then respond and help move the PR along. This is good hygiene and should already be happening. For example, HDFS committers have commit rights to all of Hadoop. But none of them would check in YARN code without getting agreement from the YARN folks. I think the majority of the effort here will be education and building the convention. We have to ask committers to watch out for API changes, know their own limits, and involve the component domain experts. We need that anyways, which btw also seems to solve the problem. It's not clear what the new process would add. It'd be good to know the details, too. What are the exact criteria for a committer to get promoted to be a maintainer? How often does the PMC re-evaluate the list of maintainers? Is there an upper bound on the number of maintainers for a component? Can we have an automatic rule for a maintainer promotion after X patches or Y lines of code in that area? Cheers, bc On Nov 6, 2014, at 10:53 AM, bc Wong bcwal...@cloudera.com wrote: Hi Matei, Good call on scaling the project itself. Identifying domain experts in different areas is a good thing. But I have some questions about the implementation. Here's my understanding of the proposal: (1) The PMC votes on a list of components and their maintainers. Changes to that list requires PMC approval. (2) No committer shall commit changes to a component without a +1 from a maintainer of that component. I see good reasons for #1, to help people navigate the project and identify expertise. For #2, I'd like to understand what problem it's trying to solve. Do we have rogue committers committing to areas that they don't know much about? If that's the case, we should address it directly, instead of adding new processes. To point out the obvious, it completely changes what committers means in Spark. Do we have clear promotion criteria from committer to maintainer? Is there a max number of maintainers per area Currently, as committers gains expertise in new areas, they could start reviewing code in those areas and give +1. This encourages more contributions and cross-component knowledge sharing. Under the new proposal, they now have to be promoted to maintainers first. That reduces our review bandwidth. Again, if there is a quality issue with code reviews, let's talk to those committers and help them do better. There are non-process ways to solve the problem. So I think we shouldn't require maintainer +1. I do like the idea of having explicit maintainers on a volunteer basis. These maintainers should watch their jira and PR traffic, and be very active in design API discussions. That leads to better consistency and long-term design choices. Cheers, bc On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer.
Re: Python3 and spark 1.1.0
Currently, Spark 1.1.0 works with Python 2.6 or higher, but not Python 3. There does seem to be interest, see also this post (http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-on-python-3-td15706.html). I believe Ariel Rokem (cced) has been trying to get it work and might be working on a PR. It would probably be good to create a JIRA ticket for this. — Jeremy - jeremyfreeman.net @thefreemanlab On Nov 6, 2014, at 6:01 PM, catchmonster skacan...@gmail.com wrote: Hi, I am interested in py3 with spark! Simply everything that I am developing in py is happening on the py3 side. is there plan to integrate spark 1.1.0 or UP with py3... it seems that is not supported in current latest version ... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Python3-and-spark-1-1-0-tp9180.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Implementing TinkerPop on top of GraphX
This was my thought exactly with the TinkerPop3 release. Looks like, to move this forward, we’d need to implement gremlin-core per http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core. The real question lies in whether GraphX can only support the OLTP functionality, or if we can bake into it the OLAP requirements as well. At a first glance I believe we could create an entire OLAP system. If so, I believe we could do this in a set of parallel subtasks, those being the implementation of each of the individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary for gremlin-core. Thoughts? From: Kyle Ellrott kellr...@soe.ucsc.edumailto:kellr...@soe.ucsc.edu Date: Thursday, November 6, 2014 at 12:10 PM To: Kushal Datta kushal.da...@gmail.commailto:kushal.da...@gmail.com Cc: Reynold Xin r...@databricks.commailto:r...@databricks.com, York, Brennon brennon.y...@capitalone.commailto:brennon.y...@capitalone.com, dev@spark.apache.orgmailto:dev@spark.apache.org dev@spark.apache.orgmailto:dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.commailto:matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then large distributed map-reduce operations. I had to hack the pipes to aggregate all of the data and pass a single object wrapping the GraphX RDDs down the pipes in a single go, rather then streaming it element by element. Just based on their description, Tinkerpop3 may be more amenable to the Spark platform. Kyle On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta kushal.da...@gmail.commailto:kushal.da...@gmail.com wrote: What do you guys think about the Tinkerpop3 Gremlin interface? It has MapReduce to run Gremlin operators in a distributed manner and Giraph to execute vertex programs. The Tinkpop3 is better suited for GraphX. On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott kellr...@soe.ucsc.edumailto:kellr...@soe.ucsc.edu wrote: I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of the Gremlin Search Language and a Parquet based graph store. I've been working out finalize some code details and putting together better code examples and documentation before I started telling people about it. But if you want to start looking at the code, I can answer any questions you have. And if you would like to contribute, I would really appreciate the help. Kyle On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin r...@databricks.commailto:r...@databricks.com wrote: cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.commailto:brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.orgmailto:dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.orgmailto:dev-h...@spark.apache.org The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or
Re: Implementing TinkerPop on top of GraphX
I think I've already done most of the work for the OLTP objects (Graph, Element, Vertex, Edge, Properties) when implementing Tinkerpop2. Singleton write operations, like addVertex/deleteEdge, were cached locally until a read operation was requested, then the set of build operations where parallelized into an RDD and merged with the existing graph. Its not efficient for large numbers of operations, but it passes unit tests and works for small graph tweaking. OLAP stuff looks completely new, but considering they have a Giraph implementation, it should be pretty straight forward. Kyle On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon brennon.y...@capitalone.com wrote: This was my thought exactly with the TinkerPop3 release. Looks like, to move this forward, we’d need to implement gremlin-core per http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core. The real question lies in whether GraphX can only support the OLTP functionality, or if we can bake into it the OLAP requirements as well. At a first glance I believe we could create an entire OLAP system. If so, I believe we could do this in a set of parallel subtasks, those being the implementation of each of the individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary for gremlin-core. Thoughts? From: Kyle Ellrott kellr...@soe.ucsc.edu Date: Thursday, November 6, 2014 at 12:10 PM To: Kushal Datta kushal.da...@gmail.com Cc: Reynold Xin r...@databricks.com, York, Brennon brennon.y...@capitalone.com, dev@spark.apache.org dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then large distributed map-reduce operations. I had to hack the pipes to aggregate all of the data and pass a single object wrapping the GraphX RDDs down the pipes in a single go, rather then streaming it element by element. Just based on their description, Tinkerpop3 may be more amenable to the Spark platform. Kyle On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta kushal.da...@gmail.com wrote: What do you guys think about the Tinkerpop3 Gremlin interface? It has MapReduce to run Gremlin operators in a distributed manner and Giraph to execute vertex programs. The Tinkpop3 is better suited for GraphX. On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of the Gremlin Search Language and a Parquet based graph store. I've been working out finalize some code details and putting together better code examples and documentation before I started telling people about it. But if you want to start looking at the code, I can answer any questions you have. And if you would like to contribute, I would really appreciate the help. Kyle On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin r...@databricks.com wrote: cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. - To unsubscribe, e-mail:
Re: [VOTE] Designating maintainers for some Spark components
I think new committers might or might not be maintainers (it would depend on the PMC vote). I don't think it would affect what you could merge, you can merge in any part of the source tree, you just need to get sign off if you want to touch a public API or make major architectural changes. Most projects already require code review from other committers before you commit something, so it's just a version of that where you have specific people appointed to specific components for review. If you look, most large software projects have a maintainer model, both in Apache and outside of it. Cloudstack is probably the best example in Apache since they are the second most active project (roughly) after Spark. They have two levels of maintainers and much strong language - their language: In general, maintainers only have commit rights on the module for which they are responsible.. I'd like us to start with something simpler and lightweight as proposed here. Really the proposal on the table is just to codify the current de-facto process to make sure we stick by it as we scale. If we want to add more formality to it or strictness, we can do it later. - Patrick On Thu, Nov 6, 2014 at 3:29 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: How would this model work with a new committer who gets voted in? Does it mean that a new committer would be a maintainer for at least one area -- else we could end up having committers who really can't merge anything significant until he becomes a maintainer. Thanks, Hari On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I think you're misunderstanding the idea of process here. The point of process is to make sure something happens automatically, which is useful to ensure a certain level of quality. For example, all our patches go through Jenkins, and nobody will make the mistake of merging them if they fail tests, or RAT checks, or API compatibility checks. The idea is to get the same kind of automation for design on these components. This is a very common process for large software projects, and it's essentially what we had already, but formalizing it will make clear that this is the process we want. It's important to do it early in order to be able to refine the process as the project grows. In terms of scope, again, the maintainers are *not* going to be the only reviewers for that component, they are just a second level of sign-off required for architecture and API. Being a maintainer is also not a promotion, it's a responsibility. Since we don't have much experience yet with this model, I didn't propose automatic rules beyond that the PMC can add / remove maintainers -- presumably the PMC is in the best position to know what the project needs. I think automatic rules are exactly the kind of process you're arguing against. The process here is about ensuring certain checks are made for every code change, not about automating personnel and development decisions. In any case, I appreciate your input on this, and we're going to evaluate the model to see how it goes. It might be that we decide we don't want it at all. However, from what I've seen of other projects (not Hadoop but projects with an order of magnitude more contributors, like Python or Linux), this is one of the best ways to have consistently great releases with a large contributor base and little room for error. With all due respect to what Hadoop's accomplished, I wouldn't use Hadoop as the best example to strive for; in my experience there I've seen patches reverted because of architectural disagreements, new APIs released and abandoned, and generally an experience that's been painful for users. A lot of the decisions we've made in Spark (e.g. time-based release cycle, built-in libraries, API stability rules, etc) were based on lessons learned there, in an attempt to define a better model. Matei On Nov 6, 2014, at 2:18 PM, bc Wong bcwal...@cloudera.com wrote: On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: snip Ultimately, the core motivation is that the project has grown to the point where it's hard to expect every committer to have full understanding of every component. Some committers know a ton about systems but little about machine learning, some are algorithmic whizzes but may not realize the implications of changing something on the Python API, etc. This is just a way to make sure that a domain expert has looked at the areas where it is most likely for something to go wrong. Hi Matei, I understand where you're coming from. My suggestion is to solve this without adding a new process. In the example above, those algo whizzes committers should realize that they're touching the Python API, and loop in some Python maintainers. Those Python maintainers would then respond and help move the PR along. This is good
Re: Implementing TinkerPop on top of GraphX
Before we dive into the implementation details, what are the high level thoughts on Gremlin/GraphX? Scala already provides the procedural way to query graphs in GraphX today. So, today I can run g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3 Gremlin, of course sans the useful operators that Gremlin offers such as outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators to GraphX api's a better approach or should we extend the existing set of transformations/actions that GraphX already offers with the useful operators from Gremlin? For example, we add as(), loop() and dedup() methods in VertexRDD and EdgeRDD. Either way we get a desperately needed graph query interface in GraphX. On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon brennon.y...@capitalone.com wrote: This was my thought exactly with the TinkerPop3 release. Looks like, to move this forward, we’d need to implement gremlin-core per http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core. The real question lies in whether GraphX can only support the OLTP functionality, or if we can bake into it the OLAP requirements as well. At a first glance I believe we could create an entire OLAP system. If so, I believe we could do this in a set of parallel subtasks, those being the implementation of each of the individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary for gremlin-core. Thoughts? From: Kyle Ellrott kellr...@soe.ucsc.edu Date: Thursday, November 6, 2014 at 12:10 PM To: Kushal Datta kushal.da...@gmail.com Cc: Reynold Xin r...@databricks.com, York, Brennon brennon.y...@capitalone.com, dev@spark.apache.org dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then large distributed map-reduce operations. I had to hack the pipes to aggregate all of the data and pass a single object wrapping the GraphX RDDs down the pipes in a single go, rather then streaming it element by element. Just based on their description, Tinkerpop3 may be more amenable to the Spark platform. Kyle On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta kushal.da...@gmail.com wrote: What do you guys think about the Tinkerpop3 Gremlin interface? It has MapReduce to run Gremlin operators in a distributed manner and Giraph to execute vertex programs. The Tinkpop3 is better suited for GraphX. On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of the Gremlin Search Language and a Parquet based graph store. I've been working out finalize some code details and putting together better code examples and documentation before I started telling people about it. But if you want to start looking at the code, I can answer any questions you have. And if you would like to contribute, I would really appreciate the help. Kyle On Thu, Nov 6, 2014 at 11:42 AM, Reynold Xin r...@databricks.com wrote: cc Matthias In the past we talked with Matthias and there were some discussions about this. On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon brennon.y...@capitalone.com wrote: All, was wondering if there had been any discussion around this topic yet? TinkerPop https://github.com/tinkerpop is a great abstraction for graph databases and has been implemented across various graph database backends / gaining traction. Has anyone thought about integrating the TinkerPop framework with GraphX to enable GraphX as another backend? Not sure if this has been brought up or not, but would certainly volunteer to spearhead this effort if the community thinks it to be a good idea! As an aside, wasn¹t sure if this discussion should happen on the board here or on JIRA, but a made a ticket as well for reference: https://issues.apache.org/jira/browse/SPARK-4279 The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete
Re: [VOTE] Designating maintainers for some Spark components
In Cloudstack, I believe one becomes a maintainer first for a subset of modules, before he/she becomes a proven maintainter who has commit rights on the entire source tree. So would it make sense to go that route, and have committers voted in as maintainers for certain parts of the codebase and then eventually become proven maintainers (though this might have be honor code based, since I don’t think git allows per module commit rights). Thanks, Hari On Thu, Nov 6, 2014 at 3:45 PM, Patrick Wendell pwend...@gmail.com wrote: I think new committers might or might not be maintainers (it would depend on the PMC vote). I don't think it would affect what you could merge, you can merge in any part of the source tree, you just need to get sign off if you want to touch a public API or make major architectural changes. Most projects already require code review from other committers before you commit something, so it's just a version of that where you have specific people appointed to specific components for review. If you look, most large software projects have a maintainer model, both in Apache and outside of it. Cloudstack is probably the best example in Apache since they are the second most active project (roughly) after Spark. They have two levels of maintainers and much strong language - their language: In general, maintainers only have commit rights on the module for which they are responsible.. I'd like us to start with something simpler and lightweight as proposed here. Really the proposal on the table is just to codify the current de-facto process to make sure we stick by it as we scale. If we want to add more formality to it or strictness, we can do it later. - Patrick On Thu, Nov 6, 2014 at 3:29 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: How would this model work with a new committer who gets voted in? Does it mean that a new committer would be a maintainer for at least one area -- else we could end up having committers who really can't merge anything significant until he becomes a maintainer. Thanks, Hari On Thu, Nov 6, 2014 at 3:00 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I think you're misunderstanding the idea of process here. The point of process is to make sure something happens automatically, which is useful to ensure a certain level of quality. For example, all our patches go through Jenkins, and nobody will make the mistake of merging them if they fail tests, or RAT checks, or API compatibility checks. The idea is to get the same kind of automation for design on these components. This is a very common process for large software projects, and it's essentially what we had already, but formalizing it will make clear that this is the process we want. It's important to do it early in order to be able to refine the process as the project grows. In terms of scope, again, the maintainers are *not* going to be the only reviewers for that component, they are just a second level of sign-off required for architecture and API. Being a maintainer is also not a promotion, it's a responsibility. Since we don't have much experience yet with this model, I didn't propose automatic rules beyond that the PMC can add / remove maintainers -- presumably the PMC is in the best position to know what the project needs. I think automatic rules are exactly the kind of process you're arguing against. The process here is about ensuring certain checks are made for every code change, not about automating personnel and development decisions. In any case, I appreciate your input on this, and we're going to evaluate the model to see how it goes. It might be that we decide we don't want it at all. However, from what I've seen of other projects (not Hadoop but projects with an order of magnitude more contributors, like Python or Linux), this is one of the best ways to have consistently great releases with a large contributor base and little room for error. With all due respect to what Hadoop's accomplished, I wouldn't use Hadoop as the best example to strive for; in my experience there I've seen patches reverted because of architectural disagreements, new APIs released and abandoned, and generally an experience that's been painful for users. A lot of the decisions we've made in Spark (e.g. time-based release cycle, built-in libraries, API stability rules, etc) were based on lessons learned there, in an attempt to define a better model. Matei On Nov 6, 2014, at 2:18 PM, bc Wong bcwal...@cloudera.com wrote: On Thu, Nov 6, 2014 at 11:25 AM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: snip Ultimately, the core motivation is that the project has grown to the point where it's hard to expect every committer to have full understanding of every component. Some committers know a ton about systems but little about machine learning, some are algorithmic whizzes but may
Re: [VOTE] Designating maintainers for some Spark components
-1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who???s responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another maintainer. If you'd like to see examples for this model, check out the following projects: - CloudStack: https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
Re: [VOTE] Designating maintainers for some Spark components
Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who's responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g. they might just sign off on architecture / API). To prevent inactive maintainers from blocking the project, if a maintainer isn't responding in a reasonable time period (say 2 weeks), other committers can merge the patch, and the PMC will want to discuss adding another
Re: [VOTE] Designating maintainers for some Spark components
In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who's responsible for each module and ask them for reviews, etc, rather than having patches slip between the cracks. We'd like to start with in a light-weight manner, where the model only applies to certain key components (e.g. scheduler, shuffle) and user-facing APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand it if we deem it useful. The specific mechanics would be as follows: - Some components in Spark will have maintainers assigned to them, where one of the maintainers needs to sign off on each patch to the component. - Each component with maintainers will have at least 2 maintainers. - Maintainers will be assigned from the most active and knowledgeable committers on that component by the PMC. The PMC can vote to add / remove maintainers, and maintained components, through consensus. - Maintainers are expected to be active in responding to patches for their components, though they do not need to be the main reviewers for them (e.g.
Re: Implementing TinkerPop on top of GraphX
My personal 2c is that, since GraphX is just beginning to provide a full featured graph API, I think it would be better to align with the TinkerPop group rather than roll our own. In my mind the benefits out way the detriments as follows: Benefits: * GraphX gains the ability to become another core tenant within the TinkerPop community allowing a more diverse group of users into the Spark ecosystem. * TinkerPop can continue to maintain and own a solid / feature-rich graph API that has already been accepted by a wide audience, relieving the pressure of “one off” API additions from the GraphX team. * GraphX can demonstrate its ability to be a key player in the GraphDB space sitting inline with other major distributions (Neo4j, Titan, etc.). * Allows for the abstract graph traversal logic (query API) to be owned and maintained by a group already proven on the topic. Drawbacks: * GraphX doesn’t own the API for its graph query capability. This could be seen as good or bad, but it might make GraphX-specific implementation additions more tricky (possibly). Also, GraphX will need to maintain the features described within the TinkerPop API as that might change in the future. From: Kushal Datta kushal.da...@gmail.commailto:kushal.da...@gmail.com Date: Thursday, November 6, 2014 at 4:00 PM To: York, Brennon brennon.y...@capitalone.commailto:brennon.y...@capitalone.com Cc: Kyle Ellrott kellr...@soe.ucsc.edumailto:kellr...@soe.ucsc.edu, Reynold Xin r...@databricks.commailto:r...@databricks.com, dev@spark.apache.orgmailto:dev@spark.apache.org dev@spark.apache.orgmailto:dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.commailto:matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX Before we dive into the implementation details, what are the high level thoughts on Gremlin/GraphX? Scala already provides the procedural way to query graphs in GraphX today. So, today I can run g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3 Gremlin, of course sans the useful operators that Gremlin offers such as outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators to GraphX api's a better approach or should we extend the existing set of transformations/actions that GraphX already offers with the useful operators from Gremlin? For example, we add as(), loop() and dedup() methods in VertexRDD and EdgeRDD. Either way we get a desperately needed graph query interface in GraphX. On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon brennon.y...@capitalone.commailto:brennon.y...@capitalone.com wrote: This was my thought exactly with the TinkerPop3 release. Looks like, to move this forward, we’d need to implement gremlin-core per http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core. The real question lies in whether GraphX can only support the OLTP functionality, or if we can bake into it the OLAP requirements as well. At a first glance I believe we could create an entire OLAP system. If so, I believe we could do this in a set of parallel subtasks, those being the implementation of each of the individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary for gremlin-core. Thoughts? From: Kyle Ellrott kellr...@soe.ucsc.edumailto:kellr...@soe.ucsc.edu Date: Thursday, November 6, 2014 at 12:10 PM To: Kushal Datta kushal.da...@gmail.commailto:kushal.da...@gmail.com Cc: Reynold Xin r...@databricks.commailto:r...@databricks.com, York, Brennon brennon.y...@capitalone.commailto:brennon.y...@capitalone.com, dev@spark.apache.orgmailto:dev@spark.apache.org dev@spark.apache.orgmailto:dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.commailto:matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then large distributed map-reduce operations. I had to hack the pipes to aggregate all of the data and pass a single object wrapping the GraphX RDDs down the pipes in a single go, rather then streaming it element by element. Just based on their description, Tinkerpop3 may be more amenable to the Spark platform. Kyle On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta kushal.da...@gmail.commailto:kushal.da...@gmail.com wrote: What do you guys think about the Tinkerpop3 Gremlin interface? It has MapReduce to run Gremlin operators in a distributed manner and Giraph to execute vertex programs. The Tinkpop3 is better suited for GraphX. On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott kellr...@soe.ucsc.edumailto:kellr...@soe.ucsc.edu wrote: I've taken a crack at implementing the TinkerPop Blueprints API in GraphX ( https://github.com/kellrott/sparkgraph ). I've also implemented portions of
Re: MatrixFactorizationModel predict(Int, Int) API
I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product } require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb
Re: MatrixFactorizationModel predict(Int, Int) API
ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
+1 (non-binding) [for original process proposal] Greg, the first time I've seen the word ownership on this thread is in your message. The first time the word lead has appeared in this thread is in your message as well. I don't think that was the intent. The PMC and Committers have a responsibility to the community to make sure that their patches are being reviewed and committed. I don't see in Apache's recommended bylaws anywhere that says establishing responsibility on paper for specific areas cannot be taken on by different members of the PMC. What's been proposed looks, to me, to be an empirical process and it looks like it has pretty much a consensus from the side able to give binding votes. I don't at all this model establishes any form of ownership over anything. I also don't see in the process proposal where it mentions that nobody other than the persons responsible for a module can review or commit code. In fact, I'll go as far as to say that since Apache is a meritocracy, the people who have been aligned to the responsibilities probably were aligned based on some sort of meric, correct? Perhaps we could dig in and find out for sure... I'm still getting familiar with the Spark community myself. On Thu, Nov 6, 2014 at 7:28 PM, Patrick Wendell pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two
Re: [VOTE] Designating maintainers for some Spark components
Partial committers are people invited to work on a particular area, and they do not require sign-off to work on that area. They can get a sign-off and commit outside that area. That approach doesn't compare to this proposal. Full committers are PMC members. As each PMC member is responsible for *every* line of code, then every PMC member should have complete rights to every line of code. Creating disparity flies in the face of a PMC member's responsibility. If I am a Spark PMC member, then I have responsibility for GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And interposing a barrier inhibits my responsibility to ensure GraphX is designed, maintained, and delivered to the Public. Cheers, -g (and yes, I'm aware of COMMITTERS; I've been changing that file for the past 12 years :-) ) On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its maintainers. Most existing large projects do this -- at Apache, some large ones with this model are CloudStack (the second-most active project overall), Subversion, and Kafka, and other examples include Linux and Python. This is also by-and-large how Spark operates today -- most components have a de-facto maintainer. IMO, adopting this model would have two benefits: 1) Consistent oversight of design for that component, especially regarding architecture and API. This process would ensure that the component's maintainers see all proposed changes and consider them to fit together in a good way. 2) More structure for new contributors and committers -- in particular, it would be easy to look up who's responsible for each module and ask them for reviews,
Re: [VOTE] Designating maintainers for some Spark components
So I don't understand, Greg, are the partial committers committers, or are they not? Spark also has a PMC, but our PMC currently consists of all committers (we decided not to have a differentiation when we left the incubator). I see the Subversion partial committers listed as committers on https://people.apache.org/committers-by-project.html#subversion, so I assume they are committers. As far as I can see, CloudStack is similar. Matei On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote: Partial committers are people invited to work on a particular area, and they do not require sign-off to work on that area. They can get a sign-off and commit outside that area. That approach doesn't compare to this proposal. Full committers are PMC members. As each PMC member is responsible for *every* line of code, then every PMC member should have complete rights to every line of code. Creating disparity flies in the face of a PMC member's responsibility. If I am a Spark PMC member, then I have responsibility for GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And interposing a barrier inhibits my responsibility to ensure GraphX is designed, maintained, and delivered to the Public. Cheers, -g (and yes, I'm aware of COMMITTERS; I've been changing that file for the past 12 years :-) ) On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com mailto:gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In this kind of large project, one common way to scale development is to assign maintainers to oversee key components, where each patch to that component needs to get sign-off from at least one of its
Re: MatrixFactorizationModel predict(Int, Int) API
model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb
Re: MatrixFactorizationModel predict(Int, Int) API
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066 The easiest case is when one side is small. If both sides are large, this is a super-expensive operation. We can do block-wise cross product and then find top-k for each user. Best, Xiangrui On Thu, Nov 6, 2014 at 4:51 PM, Debasish Das debasish.da...@gmail.com wrote: model.recommendProducts can only be called from the master then ? I have a set of 20% users on whom I am performing the test...the 20% users are in a RDD...if I have to collect them all to master node and then call model.recommendProducts, that's a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map`. -Xiangrui On Thu, Nov 6, 2014 at 4:39 PM, Debasish Das debasish.da...@gmail.com wrote: I reproduced the problem in mllib tests ALSSuite.scala using the following functions: val arrayPredict = userProductsRDD.map{case(user,product) = val recommendedProducts = model.recommendProducts(user, products) val productScore = recommendedProducts.find{x=x.product == product} require(productScore != None) productScore.get }.collect arrayPredict.foreach { elem = if (allRatings.get(elem.user, elem.product) != elem.rating) fail(Prediction APIs don't match) } If the usage of model.recommendProducts is correct, the test fails with the same error I sent before... org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 316.0 failed 1 times, most recent failure: Lost task 0.0 in stage 316.0 (TID 79, localhost): scala.MatchError: null org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:825) org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:81) It is a blocker for me and I am debugging it. I will open up a JIRA if this is indeed a bug... Do I have to cache the models to make userFeatures.lookup(user).head to work ? On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but the code fails on userFeatures.lookup(user).head In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been called and in all the test-cases that API has been used... I can perhaps refactor my code to do the same but I was wondering whether people test the lookup(user) version of the code.. Do I need to cache the model to make it work ? I think right now default is STORAGE_AND_DISK... Thanks. Deb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Designating maintainers for some Spark components
PMC [1] is responsible for oversight and does not designate partial or full committer. There are projects where all committers become PMC and others where PMC is reserved for committers with the most merit (and willingness to take on the responsibility of project oversight, releases, etc...). Community maintains the codebase through committers. Committers to mentor, roll in patches, and spread the project throughout other communities. Adding someone's name to a list as a maintainer is not a barrier. With a community as large as Spark's, and myself not being a committer on this project, I see it as a welcome opportunity to find a mentor in the areas in which I'm interested in contributing. We'd expect the list of names to grow as more volunteers gain more interest, correct? To me, that seems quite contrary to a barrier. [1] http://www.apache.org/dev/pmc.html On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: So I don't understand, Greg, are the partial committers committers, or are they not? Spark also has a PMC, but our PMC currently consists of all committers (we decided not to have a differentiation when we left the incubator). I see the Subversion partial committers listed as committers on https://people.apache.org/committers-by-project.html#subversion, so I assume they are committers. As far as I can see, CloudStack is similar. Matei On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote: Partial committers are people invited to work on a particular area, and they do not require sign-off to work on that area. They can get a sign-off and commit outside that area. That approach doesn't compare to this proposal. Full committers are PMC members. As each PMC member is responsible for *every* line of code, then every PMC member should have complete rights to every line of code. Creating disparity flies in the face of a PMC member's responsibility. If I am a Spark PMC member, then I have responsibility for GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And interposing a barrier inhibits my responsibility to ensure GraphX is designed, maintained, and delivered to the Public. Cheers, -g (and yes, I'm aware of COMMITTERS; I've been changing that file for the past 12 years :-) ) On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com mailto: gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a
Re: [VOTE] Designating maintainers for some Spark components
It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among committers * In the latter, maintainers / partial committers are a way of giving non-committers some power to make changes -Sandy On Thu, Nov 6, 2014 at 5:17 PM, Corey Nolet cjno...@gmail.com wrote: PMC [1] is responsible for oversight and does not designate partial or full committer. There are projects where all committers become PMC and others where PMC is reserved for committers with the most merit (and willingness to take on the responsibility of project oversight, releases, etc...). Community maintains the codebase through committers. Committers to mentor, roll in patches, and spread the project throughout other communities. Adding someone's name to a list as a maintainer is not a barrier. With a community as large as Spark's, and myself not being a committer on this project, I see it as a welcome opportunity to find a mentor in the areas in which I'm interested in contributing. We'd expect the list of names to grow as more volunteers gain more interest, correct? To me, that seems quite contrary to a barrier. [1] http://www.apache.org/dev/pmc.html On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: So I don't understand, Greg, are the partial committers committers, or are they not? Spark also has a PMC, but our PMC currently consists of all committers (we decided not to have a differentiation when we left the incubator). I see the Subversion partial committers listed as committers on https://people.apache.org/committers-by-project.html#subversion, so I assume they are committers. As far as I can see, CloudStack is similar. Matei On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote: Partial committers are people invited to work on a particular area, and they do not require sign-off to work on that area. They can get a sign-off and commit outside that area. That approach doesn't compare to this proposal. Full committers are PMC members. As each PMC member is responsible for *every* line of code, then every PMC member should have complete rights to every line of code. Creating disparity flies in the face of a PMC member's responsibility. If I am a Spark PMC member, then I have responsibility for GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And interposing a barrier inhibits my responsibility to ensure GraphX is designed, maintained, and delivered to the Public. Cheers, -g (and yes, I'm aware of COMMITTERS; I've been changing that file for the past 12 years :-) ) On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: In fact, if you look at the subversion commiter list, the majority of people here have commit access only for particular areas of the project: http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: Hey Greg, Regarding subversion - I think the reference is to partial vs full committers here: https://subversion.apache.org/docs/community-guide/roles.html https://subversion.apache.org/docs/community-guide/roles.html - Patrick On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com mailto: gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific
Re: Implementing TinkerPop on top of GraphX
I think its best to look to existing standard rather then try to make your own. Of course small additions would need to be added to make it valuable for the Spark community, like a method similar to Gremlin's 'table' function, that produces an RDD instead. But there may be a lot of extra code and data structures that would need to be added to make it work, and those may not be directly applicable to all GraphX users. I think it would be best run as a separate module/project that builds directly on top of GraphX. Kyle On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon brennon.y...@capitalone.com wrote: My personal 2c is that, since GraphX is just beginning to provide a full featured graph API, I think it would be better to align with the TinkerPop group rather than roll our own. In my mind the benefits out way the detriments as follows: Benefits: * GraphX gains the ability to become another core tenant within the TinkerPop community allowing a more diverse group of users into the Spark ecosystem. * TinkerPop can continue to maintain and own a solid / feature-rich graph API that has already been accepted by a wide audience, relieving the pressure of “one off” API additions from the GraphX team. * GraphX can demonstrate its ability to be a key player in the GraphDB space sitting inline with other major distributions (Neo4j, Titan, etc.). * Allows for the abstract graph traversal logic (query API) to be owned and maintained by a group already proven on the topic. Drawbacks: * GraphX doesn’t own the API for its graph query capability. This could be seen as good or bad, but it might make GraphX-specific implementation additions more tricky (possibly). Also, GraphX will need to maintain the features described within the TinkerPop API as that might change in the future. From: Kushal Datta kushal.da...@gmail.com Date: Thursday, November 6, 2014 at 4:00 PM To: York, Brennon brennon.y...@capitalone.com Cc: Kyle Ellrott kellr...@soe.ucsc.edu, Reynold Xin r...@databricks.com, dev@spark.apache.org dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX Before we dive into the implementation details, what are the high level thoughts on Gremlin/GraphX? Scala already provides the procedural way to query graphs in GraphX today. So, today I can run g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3 Gremlin, of course sans the useful operators that Gremlin offers such as outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators to GraphX api's a better approach or should we extend the existing set of transformations/actions that GraphX already offers with the useful operators from Gremlin? For example, we add as(), loop() and dedup() methods in VertexRDD and EdgeRDD. Either way we get a desperately needed graph query interface in GraphX. On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon brennon.y...@capitalone.com wrote: This was my thought exactly with the TinkerPop3 release. Looks like, to move this forward, we’d need to implement gremlin-core per http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core. The real question lies in whether GraphX can only support the OLTP functionality, or if we can bake into it the OLAP requirements as well. At a first glance I believe we could create an entire OLAP system. If so, I believe we could do this in a set of parallel subtasks, those being the implementation of each of the individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary for gremlin-core. Thoughts? From: Kyle Ellrott kellr...@soe.ucsc.edu Date: Thursday, November 6, 2014 at 12:10 PM To: Kushal Datta kushal.da...@gmail.com Cc: Reynold Xin r...@databricks.com, York, Brennon brennon.y...@capitalone.com, dev@spark.apache.org dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then large distributed map-reduce operations. I had to hack the pipes to aggregate all of the data and pass a single object wrapping the GraphX RDDs down the pipes in a single go, rather then streaming it element by element. Just based on their description, Tinkerpop3 may be more amenable to the Spark platform. Kyle On Thu, Nov 6, 2014 at 11:55 AM, Kushal Datta kushal.da...@gmail.com wrote: What do you guys think about the Tinkerpop3 Gremlin interface? It has MapReduce to run Gremlin operators in a distributed manner and Giraph to execute vertex programs. The Tinkpop3 is better suited for GraphX. On Thu, Nov 6, 2014 at 11:48 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote:
Re: Implementing TinkerPop on top of GraphX
Some form of graph querying support would be great to have. This can be a great community project hosted outside of Spark initially, both due to the maturity of the component itself as well as the maturity of query language standards (there isn't really a dominant standard for graph ql). One thing is that GraphX API will need to evolve and probably need to provide more primitives in order to support the new ql implementation. There might also be inherent mismatches in the way the external API is defined vs what GraphX can support. We should discuss those on a case-by-case basis. On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I think its best to look to existing standard rather then try to make your own. Of course small additions would need to be added to make it valuable for the Spark community, like a method similar to Gremlin's 'table' function, that produces an RDD instead. But there may be a lot of extra code and data structures that would need to be added to make it work, and those may not be directly applicable to all GraphX users. I think it would be best run as a separate module/project that builds directly on top of GraphX. Kyle On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon brennon.y...@capitalone.com wrote: My personal 2c is that, since GraphX is just beginning to provide a full featured graph API, I think it would be better to align with the TinkerPop group rather than roll our own. In my mind the benefits out way the detriments as follows: Benefits: * GraphX gains the ability to become another core tenant within the TinkerPop community allowing a more diverse group of users into the Spark ecosystem. * TinkerPop can continue to maintain and own a solid / feature-rich graph API that has already been accepted by a wide audience, relieving the pressure of “one off” API additions from the GraphX team. * GraphX can demonstrate its ability to be a key player in the GraphDB space sitting inline with other major distributions (Neo4j, Titan, etc.). * Allows for the abstract graph traversal logic (query API) to be owned and maintained by a group already proven on the topic. Drawbacks: * GraphX doesn’t own the API for its graph query capability. This could be seen as good or bad, but it might make GraphX-specific implementation additions more tricky (possibly). Also, GraphX will need to maintain the features described within the TinkerPop API as that might change in the future. From: Kushal Datta kushal.da...@gmail.com Date: Thursday, November 6, 2014 at 4:00 PM To: York, Brennon brennon.y...@capitalone.com Cc: Kyle Ellrott kellr...@soe.ucsc.edu, Reynold Xin r...@databricks.com, dev@spark.apache.org dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX Before we dive into the implementation details, what are the high level thoughts on Gremlin/GraphX? Scala already provides the procedural way to query graphs in GraphX today. So, today I can run g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3 Gremlin, of course sans the useful operators that Gremlin offers such as outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators to GraphX api's a better approach or should we extend the existing set of transformations/actions that GraphX already offers with the useful operators from Gremlin? For example, we add as(), loop() and dedup() methods in VertexRDD and EdgeRDD. Either way we get a desperately needed graph query interface in GraphX. On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon brennon.y...@capitalone.com wrote: This was my thought exactly with the TinkerPop3 release. Looks like, to move this forward, we’d need to implement gremlin-core per http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core. The real question lies in whether GraphX can only support the OLTP functionality, or if we can bake into it the OLAP requirements as well. At a first glance I believe we could create an entire OLAP system. If so, I believe we could do this in a set of parallel subtasks, those being the implementation of each of the individual API’s (Structure, Process, and, if OLAP, GraphComputer) necessary for gremlin-core. Thoughts? From: Kyle Ellrott kellr...@soe.ucsc.edu Date: Thursday, November 6, 2014 at 12:10 PM To: Kushal Datta kushal.da...@gmail.com Cc: Reynold Xin r...@databricks.com, York, Brennon brennon.y...@capitalone.com, dev@spark.apache.org dev@spark.apache.org, Matthias Broecheler matth...@thinkaurelius.com Subject: Re: Implementing TinkerPop on top of GraphX I still have to dig into the Tinkerpop3 internals (I started my work long before it had been released), but I can say that to get the Tinerpop2 Gremlin pipeline to work in the GraphX was a bit of a hack. The whole Tinkerpop2 Gremlin design was based around streaming pipes of data, rather then large distributed
Re: [VOTE] Designating maintainers for some Spark components
My 2 cents: Spark since pre-Apache days has been the most friendly and welcoming open source project I've seen, and that's reflected in its success. It seems pretty obvious to me that, for example, Michael should be looking at major changes to the SQL codebase. I trust him to do that in a way that's technically and socially appropriate. What Matei is saying makes sense, regardless of whether it gets codified in a process. On Thu, Nov 6, 2014 at 7:46 PM, Arun C Murthy a...@hortonworks.com wrote: With my ASF Member hat on, I fully agree with Greg. As he points out, this is an anti-pattern in the ASF and is severely frowned upon. We, in Hadoop, had a similar trajectory where we had were politely told to go away from having sub-project committers (HDFS, MapReduce etc.) to a common list of committers. There were some concerns initially, but we have successfully managed to work together and build a more healthy community as a result of following the advice on the ASF Way. I do have sympathy for good oversight etc. as the project grows and attracts many contributors - it's essentially the need to have smaller, well-knit developer communities. One way to achieve that would be to have separate TLPs (e.g. Spark, MLLIB, GraphX) with separate committer lists for each representing the appropriate community. Hadoop went a similar route where we had Pig, Hive, HBase etc. as sub-projects initially and then split them into TLPs with more focussed communities to the benefit of everyone. Maybe you guys want to try this too? Few more observations: # In general, *discussions* on project directions (such as new concept of *maintainers*) should happen first on the public lists *before* voting, not in the private PMC list. # If you chose to go this route in spite of this advice, seems to me Spark would be better of having more maintainers per component (at least 4-5), probably with a lot more diversity in terms of affiliations. Not sure if that is a concern - do you have good diversity in the proposed list? This will ensure that there are no concerns about a dominant employer controlling a project. Hope this helps - we've gone through similar journey, got through similar issues and fully embraced the Apache Way (™) as Greg points out to our benefit. thanks, Arun On Nov 6, 2014, at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed implementing a maintainer model for some of these components, similar to other large projects. As background on this, Spark has grown a lot since joining Apache. We've had over 80 contributors/month for the past 3 months, which I believe makes us the most active project in contributors/month at Apache, as well as over 500 patches/month. The codebase has also grown significantly, with new libraries for SQL, ML, graphs and more. In
Re: [VOTE] Designating maintainers for some Spark components
I'm actually going to change my non-binding to +0 for the proposal as-is. I overlooked some parts of the original proposal that, when reading over them again, do not sit well with me. one of the maintainers needs to sign off on each patch to the component, as Greg has pointed out, does seem to imply that there are committers with more power than others with regards to specific components- which does imply ownership. My thinking would be to re-work in some way as to take out the accent on ownership. I would maybe focus on things such as: 1) Other committers and contributors being forced to consult with maintainers of modules before patches can get rolled in. 2) Maintainers being assigned specifically from PMC. 3) Oversight to have more accent on keeping the community happy in a specific area of interest vice being a consultant for the design of a specific piece. On Thu, Nov 6, 2014 at 8:46 PM, Arun C Murthy a...@hortonworks.com wrote: With my ASF Member hat on, I fully agree with Greg. As he points out, this is an anti-pattern in the ASF and is severely frowned upon. We, in Hadoop, had a similar trajectory where we had were politely told to go away from having sub-project committers (HDFS, MapReduce etc.) to a common list of committers. There were some concerns initially, but we have successfully managed to work together and build a more healthy community as a result of following the advice on the ASF Way. I do have sympathy for good oversight etc. as the project grows and attracts many contributors - it's essentially the need to have smaller, well-knit developer communities. One way to achieve that would be to have separate TLPs (e.g. Spark, MLLIB, GraphX) with separate committer lists for each representing the appropriate community. Hadoop went a similar route where we had Pig, Hive, HBase etc. as sub-projects initially and then split them into TLPs with more focussed communities to the benefit of everyone. Maybe you guys want to try this too? Few more observations: # In general, *discussions* on project directions (such as new concept of *maintainers*) should happen first on the public lists *before* voting, not in the private PMC list. # If you chose to go this route in spite of this advice, seems to me Spark would be better of having more maintainers per component (at least 4-5), probably with a lot more diversity in terms of affiliations. Not sure if that is a concern - do you have good diversity in the proposed list? This will ensure that there are no concerns about a dominant employer controlling a project. Hope this helps - we've gone through similar journey, got through similar issues and fully embraced the Apache Way (™) as Greg points out to our benefit. thanks, Arun On Nov 6, 2014, at 4:18 PM, Greg Stein gst...@gmail.com wrote: -1 (non-binding) This is an idea that runs COMPLETELY counter to the Apache Way, and is to be severely frowned up. This creates *unequal* ownership of the codebase. Each Member of the PMC should have *equal* rights to all areas of the codebase until their purview. It should not be subjected to others' ownership except throught the standard mechanisms of reviews and if/when absolutely necessary, to vetos. Apache does not want leads, benevolent dictators or assigned maintainers, no matter how you may dress it up with multiple maintainers per component. The fact is that this creates an unequal level of ownership and responsibility. The Board has shut down projects that attempted or allowed for Leads. Just a few months ago, there was a problem with somebody calling themself a Lead. I don't know why you suggest that Apache Subversion does this. We absolutely do not. Never have. Never will. The Subversion codebase is owned by all of us, and we all care for every line of it. Some people know more than others, of course. But any one of us, can change any part, without being subjected to a maintainer. Of course, we ask people with more knowledge of the component when we feel uncomfortable, but we also know when it is safe or not to make a specific change. And *always*, our fellow committers can review our work and let us know when we've done something wrong. Equal ownership reduces fiefdoms, enhances a feeling of community and project ownership, and creates a more open and inviting project. So again: -1 on this entire concept. Not good, to be polite. Regards, Greg Stein Director, Vice Chairman Apache Software Foundation On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal architecture and public APIs), and to this end I've proposed
Re: [VOTE] Designating maintainers for some Spark components
[ I'm going to try and pull a couple thread directions into this one, to avoid explosion :-) ] On Thu, Nov 6, 2014 at 6:44 PM, Corey Nolet cjno...@gmail.com wrote: Note: I'm going to use you generically; I understand you [Corey] are not a PMC member, at this time. +1 (non-binding) [for original process proposal] Greg, the first time I've seen the word ownership on this thread is in your message. The first time the word lead has appeared in this thread is in your message as well. I don't think that was the intent. The PMC and Committers have a The word ownership is there, but with a different term. If you are a PMC member, and *cannot* alter a line of code without another's consent, then you don't own that code. Your ownership is subservient to another. You are not a *peer*, but a second-class citizen at this point. The term maintainer in this context is being used as a word for lead. The maintainers are a *gate* for any change. That is not consensus. The proposal attempts to soften that, and turn it into an oligarchy of several maintainers. But the simple fact is that you have some with the ability to set direction, and those who do not. They are called leaders in most contexts, but however you want to slice it... the dynamic creates people with unequal commit ability. But as the PMC member you *are* responsible for it. That is the very basic definition of being a PMC member. You are responsible for all things Spark. responsibility to the community to make sure that their patches are being reviewed and committed. I don't see in Apache's recommended bylaws anywhere that says establishing responsibility on paper for specific areas cannot be taken on by different members of the PMC. What's been proposed looks, to me, to be an empirical process and it looks like it has pretty much a consensus from the side able to give binding votes. I don't at all this model establishes any form of ownership over anything. I also don't see in the process proposal where it mentions that nobody other than the persons responsible for a module can review or commit code. where each patch to that component needs to get sign-off from at least one of its maintainers That establishes two types of PMC members: those who require sign-off, and those who don't. Apache is intended to be a group of peers, none more equal than others. That said, we *do* recognize various levels of merit. This is where you see differences between committers, their range of access, and PMC members. But when you hit the *PMC member* role, then you are talking about a legal construct established by the Foundation. You move outside of community norms, and into how the umbrella of the Foundation operates. PMC members are individually responsible for all of the code under their purview, which is then at the direction of the Foundation itself. I'll skip that conversation, and leave it with the simple phrase: as a PMC member, you're responsible for the whole codebase. So following from that, anything that *restricts* your ability to work on that code, is a problem. In fact, I'll go as far as to say that since Apache is a meritocracy, the people who have been aligned to the responsibilities probably were aligned based on some sort of meric, correct? Perhaps we could dig in and find out for sure... I'm still getting familiar with the Spark community myself. Once you are a PMC member, then there is no difference in your merit. Merit ends. You're a PMC member, and that is all there is to it. Just because Jane commits 1000 times per month, makes her no better than John who commits 10/month. They are peers on the PMC and have equal rights and responsibility to the codebase. Historically, some PMCs have attempted to create variant levels within the PMC, or create different groups and rights, or different partitions over the code, and ... again, historically: it has failed. This is why Apache stresses consensus. The failure modes are crazy and numerous when moving away from that, into silos. ... On Thu, Nov 6, 2014 at 6:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: So I don't understand, Greg, are the partial committers committers, or are they not? Spark also has a PMC, but our PMC currently consists of all committers (we decided not to have a differentiation when we left the incubator). I see the Subversion partial committers listed as committers on https://people.apache.org/committers-by-project.html#subversion, so I assume they are committers. As far as I can see, CloudStack is similar. PMC members are responsible for the code. They provide the oversight, direction, and management. (they're also responsible for the community, but that distinction isn't relevant in this contrasting example) Committers can make changes to the code, with the acknowledgement/agreement/direction of the PMC. When these groups are equal, like Spark, then things are pretty simple. But many communities in Apache define them as disparate.
Re: [VOTE] Designating maintainers for some Spark components
On Thu, Nov 6, 2014 at 7:28 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among committers * In the latter, maintainers / partial committers are a way of giving non-committers some power to make changes I can't speak for CloudStack, but for Subversion: yes, you're exactly right, Sandy. We use the partial committer role as a way to bring in new committers. Great idea, go work there, and have fun. Any PMC member can give a single +1, and that new (partial) committer gets and account/access, and is off and running. We don't even ask for a PMC vote (though, we almost always have a brief discussion). The svnrdump tool was written by a *Git* Google Summer of Code student. He wanted a quick way to get a Subversion dumpfile from a remote repository, in order to drop that into Git. We gave him commit access directly into trunk/svnrdump, and he wrote the tool. Technically, he could commit anywhere in our tree, but we just asked him not to, without a +1 from a PMC member. Partial committers are a way to *include* people into the [coding] community. And hopefully, over time, they grow into something more. Maintainers are a way (IMO) to *exclude* people from certain commit activity. (or more precisely: limit/restrict, rather than exclude) You can see why it concerns me :-) Cheers, -g
Re: [VOTE] Designating maintainers for some Spark components
Alright, Greg, I think I understand how Subversion's model is different, which is that the PMC members are all full committers. However, I still think that the model proposed here is purely organizational (how the PMC and committers organize themselves), and in no way changes peoples' ownership or rights. Certainly the reason I proposed it was organizational, to make sure patches get seen by the right people. I believe that every PMC member still has the same responsibility for two reasons: 1) The PMC is actually what selects the maintainers, so basically this mechanism is a way for the PMC to make sure certain people review each patch. 2) Code changes are all still made by consensus, where any individual has veto power over the code. The maintainer model mentioned here is only meant to make sure that the experts in an area get to see each patch *before* it is merged, and choose whether to exercise their veto power. Let me give a simple example, which is a patch to the Spark core public API. Say I'm a maintainer in this API. Without the maintainer model, the decision on the patch would be made as follows: - Any committer could review the patch and merge it - At any point during this process, I (as the main expert on this) could come in and -1 it, or give feedback - In addition, any other committer beyond me is allowed to -1 this patch With the maintainer model, the process is as follows: - Any committer could review the patch and merge it, but they would need to forward it to me (or another core API maintainer) to make sure we also approve - At any point during this process, I could come in and -1 it, or give feedback - In addition, any other committer beyond me is still allowed to -1 this patch The only change in this model is that committers are responsible to forward patches in these areas to certain other committers. If every committer had perfect oversight of the project, they could have also seen every patch to their component on their own, but this list ensures that they see it even if they somehow overlooked it. It's true that technically this model might gate development in the sense of adding some latency, but it doesn't gate it any more than consensus as a whole does, where any committer (not even PMC member) can -1 any code change. In fact I believe this will speed development by motivating the maintainers to be active in reviewing their areas and by reducing the chance that mistakes happen that require a revert. I apologize if this wasn't clear in any way, but I do think it's pretty clear in the original wording of the proposal. The sign-off by a maintainer is simply an extra step in the merge process, it does *not* mean that other committers can't -1 a patch, or that the maintainers get to review all patches, or that they somehow have more ownership of the component (since they already had the ability to -1). I also wanted to clarify another thing -- it seems there is a misunderstanding that only PMC members can be maintainers, but this was not the point; the PMC *assigns* maintainers but they can do it out of the whole committer pool (and if we move to separating the PMC from the committers, I fully expect some non-PMC committers to be made maintainers). I hope this clarifies where we're coming from, and why we believe that this still conforms fully with the spirit of Apache (collaborative, open development that anyone can participate in, and meritocracy for project governance). There were some comments made about the maintainers being only some kind of list of people without a requirement to review stuff, but as you can see it's the requirement to review that is the main reason I'm proposing this, to ensure we have an automated process for patches to certain components to be seen. If it helps we may be able to change the wording to something like it is every committer's responsibility to forward patches for a maintained component to that component's maintainer, or something like that, instead of using sign off. If we don't do this, I'd actually be against any measure that lists some component maintainers without them having a specific responsibility. Apache is not a place for people to gain kudos by having fancier titles given on a website, it's a place for building great communities and software. Matei On Nov 6, 2014, at 9:27 PM, Greg Stein gst...@gmail.com wrote: On Thu, Nov 6, 2014 at 7:28 PM, Sandy Ryza sandy.r...@cloudera.com wrote: It looks like the difference between the proposed Spark model and the CloudStack / SVN model is: * In the former, maintainers / partial committers are a way of centralizing oversight over particular components among committers * In the latter, maintainers / partial committers are a way of giving non-committers some power to make changes I can't speak for CloudStack, but for Subversion: yes, you're exactly right, Sandy. We use the partial committer role as a way to bring in new
Re: [VOTE] Designating maintainers for some Spark components
[last reply for tonite; let others read; and after the next drink or three, I shouldn't be replying...] On Thu, Nov 6, 2014 at 11:38 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, Greg, I think I understand how Subversion's model is different, which is that the PMC members are all full committers. However, I still think that the model proposed here is purely organizational (how the PMC and committers organize themselves), and in no way changes peoples' ownership or rights. That was not my impression, when your proposal said that maintainers need to provide sign-off. Okay. Now my next item of feedback starts here: Certainly the reason I proposed it was organizational, to make sure patches get seen by the right people. I believe that every PMC member still has the same responsibility for two reasons: 1) The PMC is actually what selects the maintainers, so basically this mechanism is a way for the PMC to make sure certain people review each patch. 2) Code changes are all still made by consensus, where any individual has veto power over the code. The maintainer model mentioned here is only meant to make sure that the experts in an area get to see each patch *before* it is merged, and choose whether to exercise their veto power. Let me give a simple example, which is a patch to the Spark core public API. Say I'm a maintainer in this API. Without the maintainer model, the decision on the patch would be made as follows: - Any committer could review the patch and merge it - At any point during this process, I (as the main expert on this) could come in and -1 it, or give feedback - In addition, any other committer beyond me is allowed to -1 this patch With the maintainer model, the process is as follows: - Any committer could review the patch and merge it, but they would need to forward it to me (or another core API maintainer) to make sure we also approve - At any point during this process, I could come in and -1 it, or give feedback - In addition, any other committer beyond me is still allowed to -1 this patch The only change in this model is that committers are responsible to forward patches in these areas to certain other committers. If every committer had perfect oversight of the project, they could have also seen every patch to their component on their own, but this list ensures that they see it even if they somehow overlooked it. It's true that technically this model might gate development in the sense of adding some latency, but it doesn't gate it any more than consensus as a whole does, where any committer (not even PMC member) can -1 any code change. In fact I believe this will speed development by motivating the maintainers to be active in reviewing their areas and by reducing the chance that mistakes happen that require a revert. I apologize if this wasn't clear in any way, but I do think it's pretty clear in the original wording of the proposal. The sign-off by a maintainer is simply an extra step in the merge process, it does *not* mean that other committers can't -1 a patch, or that the maintainers get to review all patches, or that they somehow have more ownership of the component (since they already had the ability to -1). I also wanted to clarify another thing -- it seems there is a misunderstanding that only PMC members can be maintainers, but this was not the point; the PMC *assigns* maintainers but they can do it out of the whole committer pool (and if we move to separating the PMC from the committers, I fully expect some non-PMC committers to be made maintainers). ... and ends here. All of that text is about a process for applying Vetoes. ... That is just the wrong focus (IMO). Back around 2000, in httpd, we ran into vetoes. It was horrible. The community suffered. We actually had a face-to-face at one point, flying in people from around the US, gathering a bunch of the httpd committers to work through some basic problems. The vetoes were flying fast and furious, and it was just the wrong dynamic. Discussion and consensus had been thrown aside. Trust was absent. Peer relationships were ruined. (tho thankfully, our personal relationships never suffered, and that basis helped us pull it back together) Contrast that with Subversion. We've had some vetoes, yes. But invariably, MOST of them would really be considered woah. -1 on that. let's talk. Only a few were about somebody laying down the veto hammer. Outside those few, a -1 was always about opening a discussion to fix a particular commit. It looks like you are creating a process to apply vetoes. That seems backwards. It seems like you want a process to ensure that reviews are performed. IMO, all committers/PMC members should begin as *trusted*. Why not? You've already voted them in as committers/PMCers. So trust them. Trust. And that leads to trust, but verify. The review process. So how about creating a workflow that is focused on what needs to be reviewed
Re: [VOTE] Designating maintainers for some Spark components
Greg, Thanks a lot for commenting on this, but I feel we are splitting hairs here. Matei did mention -1, followed by or give feedback. The original process outlined by Matei was exactly about review, rather than fighting. Nobody wants to spend their energy fighting. Everybody is doing it to improve the project. In particular, quoting you in your email Be careful here. Responsibility is pretty much a taboo word. All of Apache is a group of volunteers. People can disappear at any point, which is why you need multiple (as my fellow Director warned, on your private list). And multiple people can disappear. Take a look at this page: http://www.apache.org/dev/pmc.html This Project Management Committee Guide outlines the general ***responsibilities*** of PMC members in managing their projects. Are you suggesting the wording used by the PMC guideline itself is taboo? On Thu, Nov 6, 2014 at 11:27 PM, Greg Stein gst...@gmail.com wrote: [last reply for tonite; let others read; and after the next drink or three, I shouldn't be replying...] On Thu, Nov 6, 2014 at 11:38 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Alright, Greg, I think I understand how Subversion's model is different, which is that the PMC members are all full committers. However, I still think that the model proposed here is purely organizational (how the PMC and committers organize themselves), and in no way changes peoples' ownership or rights. That was not my impression, when your proposal said that maintainers need to provide sign-off. Okay. Now my next item of feedback starts here: Certainly the reason I proposed it was organizational, to make sure patches get seen by the right people. I believe that every PMC member still has the same responsibility for two reasons: 1) The PMC is actually what selects the maintainers, so basically this mechanism is a way for the PMC to make sure certain people review each patch. 2) Code changes are all still made by consensus, where any individual has veto power over the code. The maintainer model mentioned here is only meant to make sure that the experts in an area get to see each patch *before* it is merged, and choose whether to exercise their veto power. Let me give a simple example, which is a patch to the Spark core public API. Say I'm a maintainer in this API. Without the maintainer model, the decision on the patch would be made as follows: - Any committer could review the patch and merge it - At any point during this process, I (as the main expert on this) could come in and -1 it, or give feedback - In addition, any other committer beyond me is allowed to -1 this patch With the maintainer model, the process is as follows: - Any committer could review the patch and merge it, but they would need to forward it to me (or another core API maintainer) to make sure we also approve - At any point during this process, I could come in and -1 it, or give feedback - In addition, any other committer beyond me is still allowed to -1 this patch The only change in this model is that committers are responsible to forward patches in these areas to certain other committers. If every committer had perfect oversight of the project, they could have also seen every patch to their component on their own, but this list ensures that they see it even if they somehow overlooked it. It's true that technically this model might gate development in the sense of adding some latency, but it doesn't gate it any more than consensus as a whole does, where any committer (not even PMC member) can -1 any code change. In fact I believe this will speed development by motivating the maintainers to be active in reviewing their areas and by reducing the chance that mistakes happen that require a revert. I apologize if this wasn't clear in any way, but I do think it's pretty clear in the original wording of the proposal. The sign-off by a maintainer is simply an extra step in the merge process, it does *not* mean that other committers can't -1 a patch, or that the maintainers get to review all patches, or that they somehow have more ownership of the component (since they already had the ability to -1). I also wanted to clarify another thing -- it seems there is a misunderstanding that only PMC members can be maintainers, but this was not the point; the PMC *assigns* maintainers but they can do it out of the whole committer pool (and if we move to separating the PMC from the committers, I fully expect some non-PMC committers to be made maintainers). ... and ends here. All of that text is about a process for applying Vetoes. ... That is just the wrong focus (IMO). Back around 2000, in httpd, we ran into vetoes. It was horrible. The community suffered. We actually had a face-to-face at one point, flying in people from around the US, gathering a bunch of the httpd committers to
Re: [VOTE] Designating maintainers for some Spark components
With the maintainer model, the process is as follows: - Any committer could review the patch and merge it, but they would need to forward it to me (or another core API maintainer) to make sure we also approve - At any point during this process, I could come in and -1 it, or give feedback - In addition, any other committer beyond me is still allowed to -1 this patch The only change in this model is that committers are responsible to forward patches in these areas to certain other committers. If every committer had perfect oversight of the project, they could have also seen every patch to their component on their own, but this list ensures that they see it even if they somehow overlooked it. Having done the job of playing an informal 'maintainer' of a project myself, this is what I think you really need: The so called 'maintainers' do one of the below - Actively poll the lists and watch over contributions. And follow what is repeated often around here: Trust but verify. - Setup automated mechanisms to send all bug-tracker updates of a specific component to a list that people can subscribe to And/or - Individual contributors send review requests to unofficial 'maintainers' over dev-lists or through tools. Like many projects do with review boards and other tools. Note that none of the above is a required step. It must not be, that's the point. But once set as a convention, they will all help you address your concerns with project scalability. Anything else that you add is bestowing privileges to a select few and forming dictatorships. And contrary to what the proposal claims, this is neither scalable nor confirming to Apache governance rules. +Vinod signature.asc Description: Message signed with OpenPGP using GPGMail