[discuss] Removing individual commit messages from the squash commit message
I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names.
Re: [discuss] Removing individual commit messages from the squash commit message
+1 to removing them. Sometimes there are 50+ commits because people have been merging from master into their branch rather than rebasing. On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] Removing individual commit messages from the squash commit message
+1 to removing commit messages. On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote: +1 to removing them. Sometimes there are 50+ commits because people have been merging from master into their branch rather than rebasing. On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Writing to multiple outputs in Spark
*tl;dr hadoop and cascading* *provide ways of writing tuples to multiple output files based on key, but the plain RDD interface doesn't seem to and it should.* I have been looking into ways to write to multiple outputs in Spark. It seems like a feature that is somewhat missing from Spark. The idea is to partition output and write the elements of an RDD to different locations depending based on the key. For example in a pair RDD your key may be (language, date, userId) and you would like to write separate files to $someBasePath/$language/$date. Then there would be a version of saveAsHadoopDataset that would be able to multiple location based on key using the underlying OutputFormat. Perahps it would take a pair RDD with keys ($partitionKey, $realKey), so for example ((language, date), userId). The prior art I have found on this is the following. Using SparkSQL: The 'partitionBy' method of DataFrameWriter: https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter This only works for parquet at the moment. Using Spark/Hadoop: This pull request (with the hadoop1 API,) : https://github.com/apache/spark/pull/4895/files. This uses MultipleTextOutputFormat (which in turn uses MultipleOutputFormat) which is part of the old hadoop1 API. It only works for text but could be generalised for any underlying OutputFormat by using MultipleOutputFormat (but only for hadoop1 - which doesn't support ParquetAvroOutputFormat for example) This gist (With the hadoop2 API): https://gist.github.com/mlehman/df9546f6be2e362bbad2 This uses MultipleOutputs (available for both the old and new hadoop APIs) and extends saveAsNewHadoopDataset to support multiple outputs. Should work for any underlying OutputFormat. Probably better implemented by extending saveAs[NewAPI]HadoopDataset. In Cascading: Cascading provides PartititionTap: http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html to do this So my questions are: is there a reason why Spark doesn't provide this? Does Spark provide similar functionality through some other mechanism? How would it be best implemented? Since I started composing this message I've had a go at writing an wrapper OutputFormat that writes multiple outputs using hadoop MultipleOutputs but doesn't require modification of the PairRDDFunctions. The principle is similar however. Again it feels slightly hacky to use dummy fields for the ReduceContextImpl, but some of this may be a part of the impedance mismatch between Spark and plain Hadoop... Here is my attempt: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462 I'd like to see this functionality in Spark somehow but invite suggestion of how best to achieve it. Thanks, Silas
Re: [discuss] Removing individual commit messages from the squash commit message
+1 Sent from my iPhone On Jul 18, 2015, at 2:44 PM, Patrick Wendell pwend...@gmail.com wrote: +1 from me too On Sat, Jul 18, 2015 at 3:32 AM, Ted Yu yuzhih...@gmail.com wrote: +1 to removing commit messages. On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote: +1 to removing them. Sometimes there are 50+ commits because people have been merging from master into their branch rather than rebasing. On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] Removing individual commit messages from the squash commit message
+1 from me too On Sat, Jul 18, 2015 at 3:32 AM, Ted Yu yuzhih...@gmail.com wrote: +1 to removing commit messages. On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote: +1 to removing them. Sometimes there are 50+ commits because people have been merging from master into their branch rather than rebasing. On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [discuss] Removing individual commit messages from the squash commit message
A single commit message consisting of: 1. Pull request title (which includes JIRA number and component, e.g. [SPARK-1234][MLlib]) 2. Pull request description 3. List of authors contributing to the patch The main thing that changes is 3: we used to also include the individual commits to the pull request branch that are squashed. On Sat, Jul 18, 2015 at 3:45 PM, Mridul Muralidharan mri...@gmail.com wrote: Just to clarify, the proposal is to have a single commit msg giving the jira and pr id? That sounds like a good change to have. Regards Mridul On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names.
Re: [discuss] Removing individual commit messages from the squash commit message
Just to clarify, the proposal is to have a single commit msg giving the jira and pr id? That sounds like a good change to have. Regards Mridul On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names.
Re: Expression.resolved unmatched with the correct values in catalyst?
What if you move your addition to before line 64 (in master branch there is case for if e.checkInputDataTypes().isFailure): case c: Cast if !c.resolved = Cheers On Wed, Jul 15, 2015 at 12:47 AM, Takeshi Yamamuro linguin@gmail.com wrote: Hi, devs I found that the case of 'Expression.resolved != (Expression.childrenResolved checkInputDataTypes().isSuccess)' occurs in the output of Analyzer. That is, some tests in o.a.s.sql.* fail if the codes below are added in CheckAnalysis: https://github.com/maropu/spark/commit/a488eee8351f5ec49854eef0266e4445269d5867 Is this a correct behaviour in catalyst? If correct, anyone explains the case if this happens? Thanks, takeshi -- --- Takeshi Yamamuro (maropu)
If gmail, check sparm
https://plus.google.com/+LinusTorvalds/posts/DiG9qANf5PA I have noticed a bunch of mails from dev@ and github going to spam - including spark maliing list. Might be a good idea for dev, committers to check if they are missing things in their spam folder if on gmail. Regards, Mridul - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Dynamic resource allocation in Standalone mode
Hi all, I am planning to dynamically increase or decrease the number of executors allocated to an application during runtime, and it is similar to dynamic resource allocation, which is only feasible in Spark on Yarn mode. Any suggestions on how to implement this feature in Standalone mode? My current problem is: I want to send a ADD_EXECUTOR command from scheduler module (in CoarseGrainedSchedulerBackend.scala) to deploy module (in Master.scala), but don't know how to communicate between the two modules Great thanks for any suggestions!
Re: [discuss] Removing individual commit messages from the squash commit message
Thanks for detailing, definitely sounds better. +1 Regards Mridul On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote: A single commit message consisting of: 1. Pull request title (which includes JIRA number and component, e.g. [SPARK-1234][MLlib]) 2. Pull request description 3. List of authors contributing to the patch The main thing that changes is 3: we used to also include the individual commits to the pull request branch that are squashed. On Sat, Jul 18, 2015 at 3:45 PM, Mridul Muralidharan mri...@gmail.com javascript:_e(%7B%7D,'cvml','mri...@gmail.com'); wrote: Just to clarify, the proposal is to have a single commit msg giving the jira and pr id? That sounds like a good change to have. Regards Mridul On Saturday, July 18, 2015, Reynold Xin r...@databricks.com javascript:_e(%7B%7D,'cvml','r...@databricks.com'); wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace 6d874a6 [xxx] support pyspark for yarn-client 89b01f5 [yyy] Update the unit test to add more cases 275d252 [yyy] Address the comments 7cc146d [yyy] Address the comments 2624723 [yyy] Fix rebase conflict 45befaa [yyy] Update the unit test bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue Anybody against removing those from the merge script so the log looks cleaner? If nobody feels strongly about this, we can just create a JIRA to remove them, and only keep the author names.
Re: If gmail, check sparm
Interesting read. I did find a lot of Spark mails in Spam folder. Thanks Mridul On Jul 18, 2015, at 10:25 AM, Mridul Muralidharan mri...@gmail.com wrote: https://plus.google.com/+LinusTorvalds/posts/DiG9qANf5PA I have noticed a bunch of mails from dev@ and github going to spam - including spark maliing list. Might be a good idea for dev, committers to check if they are missing things in their spam folder if on gmail. Regards, Mridul - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org