[discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Reynold Xin
I took a look at the commit messages in git log -- it looks like the
individual commit messages are not that useful to include, but do make the
commit messages more verbose. They are usually just a bunch of extremely
concise descriptions of bug fixes, merges, etc:

cb3f12d [xxx] add whitespace
6d874a6 [xxx] support pyspark for yarn-client

89b01f5 [yyy] Update the unit test to add more cases
275d252 [yyy] Address the comments
7cc146d [yyy] Address the comments
2624723 [yyy] Fix rebase conflict
45befaa [yyy] Update the unit test
bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue


Anybody against removing those from the merge script so the log looks
cleaner? If nobody feels strongly about this, we can just create a JIRA to
remove them, and only keep the author names.


Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Sean Owen
+1 to removing them. Sometimes there are 50+ commits because people
have been merging from master into their branch rather than rebasing.

On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote:
 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:

 cb3f12d [xxx] add whitespace
 6d874a6 [xxx] support pyspark for yarn-client

 89b01f5 [yyy] Update the unit test to add more cases
 275d252 [yyy] Address the comments
 7cc146d [yyy] Address the comments
 2624723 [yyy] Fix rebase conflict
 45befaa [yyy] Update the unit test
 bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue


 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Ted Yu
+1 to removing commit messages. 



 On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote:
 
 +1 to removing them. Sometimes there are 50+ commits because people
 have been merging from master into their branch rather than rebasing.
 
 On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote:
 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:
 
cb3f12d [xxx] add whitespace
6d874a6 [xxx] support pyspark for yarn-client
 
89b01f5 [yyy] Update the unit test to add more cases
275d252 [yyy] Address the comments
7cc146d [yyy] Address the comments
2624723 [yyy] Fix rebase conflict
45befaa [yyy] Update the unit test
bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue
 
 
 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Writing to multiple outputs in Spark

2015-07-18 Thread Silas Davis
*tl;dr hadoop and cascading* *provide ways of writing tuples to multiple
output files based on key, but the plain RDD interface doesn't seem to and
it should.*

I have been looking into ways to write to multiple outputs in Spark. It
seems like a feature that is somewhat missing from Spark.

The idea is to partition output and write the elements of an RDD to
different locations depending based on the key. For example in a pair RDD
your key may be (language, date, userId) and you would like to write
separate files to $someBasePath/$language/$date. Then there would be  a
version of saveAsHadoopDataset that would be able to multiple location
based on key using the underlying OutputFormat. Perahps it would take a
pair RDD with keys ($partitionKey, $realKey), so for example ((language,
date), userId).

The prior art I have found on this is the following.

Using SparkSQL:
The 'partitionBy' method of DataFrameWriter:
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

This only works for parquet at the moment.

Using Spark/Hadoop:
This pull request (with the hadoop1 API,) :
https://github.com/apache/spark/pull/4895/files.

This uses MultipleTextOutputFormat (which in turn uses
MultipleOutputFormat) which is part of the old hadoop1 API. It only works
for text but could be generalised for any underlying OutputFormat by using
MultipleOutputFormat (but only for hadoop1 - which doesn't support
ParquetAvroOutputFormat for example)

This gist (With the hadoop2 API):
https://gist.github.com/mlehman/df9546f6be2e362bbad2

This uses MultipleOutputs (available for both the old and new hadoop APIs)
and extends saveAsNewHadoopDataset to support multiple outputs. Should work
for any underlying OutputFormat. Probably better implemented by extending
saveAs[NewAPI]HadoopDataset.

In Cascading:
Cascading provides PartititionTap:
http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html
to do this

So my questions are: is there a reason why Spark doesn't provide this? Does
Spark provide similar functionality through some other mechanism? How would
it be best implemented?

Since I started composing this message I've had a go at writing an wrapper
OutputFormat that writes multiple outputs using hadoop MultipleOutputs but
doesn't require modification of the PairRDDFunctions. The principle is
similar however. Again it feels slightly hacky to use dummy fields for the
ReduceContextImpl, but some of this may be a part of the impedance mismatch
between Spark and plain Hadoop... Here is my attempt:
https://gist.github.com/silasdavis/d1d1f1f7ab78249af462

I'd like to see this functionality in Spark somehow but invite suggestion
of how best to achieve it.

Thanks,
Silas


Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Ram Sriharsha
+1 

Sent from my iPhone

 On Jul 18, 2015, at 2:44 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 +1 from me too
 
 On Sat, Jul 18, 2015 at 3:32 AM, Ted Yu yuzhih...@gmail.com wrote:
 +1 to removing commit messages.
 
 
 
 On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote:
 
 +1 to removing them. Sometimes there are 50+ commits because people
 have been merging from master into their branch rather than rebasing.
 
 On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote:
 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:
 
   cb3f12d [xxx] add whitespace
   6d874a6 [xxx] support pyspark for yarn-client
 
   89b01f5 [yyy] Update the unit test to add more cases
   275d252 [yyy] Address the comments
   7cc146d [yyy] Address the comments
   2624723 [yyy] Fix rebase conflict
   45befaa [yyy] Update the unit test
   bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue
 
 
 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Patrick Wendell
+1 from me too

On Sat, Jul 18, 2015 at 3:32 AM, Ted Yu yuzhih...@gmail.com wrote:
 +1 to removing commit messages.



 On Jul 18, 2015, at 1:35 AM, Sean Owen so...@cloudera.com wrote:

 +1 to removing them. Sometimes there are 50+ commits because people
 have been merging from master into their branch rather than rebasing.

 On Sat, Jul 18, 2015 at 8:48 AM, Reynold Xin r...@databricks.com wrote:
 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:

cb3f12d [xxx] add whitespace
6d874a6 [xxx] support pyspark for yarn-client

89b01f5 [yyy] Update the unit test to add more cases
275d252 [yyy] Address the comments
7cc146d [yyy] Address the comments
2624723 [yyy] Fix rebase conflict
45befaa [yyy] Update the unit test
bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue


 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Reynold Xin
A single commit message consisting of:

1. Pull request title (which includes JIRA number and component, e.g.
[SPARK-1234][MLlib])

2. Pull request description

3. List of authors contributing to the patch

The main thing that changes is 3: we used to also include the individual
commits to the pull request branch that are squashed.


On Sat, Jul 18, 2015 at 3:45 PM, Mridul Muralidharan mri...@gmail.com
wrote:

 Just to clarify, the proposal is to have a single commit msg giving the
 jira and pr id?
 That sounds like a good change to have.

 Regards
 Mridul


 On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote:

 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:

 cb3f12d [xxx] add whitespace
 6d874a6 [xxx] support pyspark for yarn-client

 89b01f5 [yyy] Update the unit test to add more cases
 275d252 [yyy] Address the comments
 7cc146d [yyy] Address the comments
 2624723 [yyy] Fix rebase conflict
 45befaa [yyy] Update the unit test
 bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue


 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.




Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Mridul Muralidharan
Just to clarify, the proposal is to have a single commit msg giving the
jira and pr id?
That sounds like a good change to have.

Regards
Mridul

On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote:

 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:

 cb3f12d [xxx] add whitespace
 6d874a6 [xxx] support pyspark for yarn-client

 89b01f5 [yyy] Update the unit test to add more cases
 275d252 [yyy] Address the comments
 7cc146d [yyy] Address the comments
 2624723 [yyy] Fix rebase conflict
 45befaa [yyy] Update the unit test
 bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue


 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.




Re: Expression.resolved unmatched with the correct values in catalyst?

2015-07-18 Thread Ted Yu
What if you move your addition to before line 64 (in master branch there is
case for if e.checkInputDataTypes().isFailure):

  case c: Cast if !c.resolved =

Cheers

On Wed, Jul 15, 2015 at 12:47 AM, Takeshi Yamamuro linguin@gmail.com
wrote:

 Hi, devs

 I found that the case of 'Expression.resolved !=
 (Expression.childrenResolved  checkInputDataTypes().isSuccess)'
 occurs in the output of Analyzer.
 That is, some tests in o.a.s.sql.* fail if the codes below are added in
 CheckAnalysis:


 https://github.com/maropu/spark/commit/a488eee8351f5ec49854eef0266e4445269d5867

 Is this a correct behaviour in catalyst?
 If correct, anyone explains the case if this happens?

 Thanks,
 takeshi

 --
 ---
 Takeshi Yamamuro (maropu)



If gmail, check sparm

2015-07-18 Thread Mridul Muralidharan
https://plus.google.com/+LinusTorvalds/posts/DiG9qANf5PA

I have noticed a bunch of mails from dev@ and github going to spam -
including spark maliing list.
Might be a good idea for dev, committers to check if they are missing
things in their spam folder if on gmail.

Regards,
Mridul

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Dynamic resource allocation in Standalone mode

2015-07-18 Thread Dogtail Ray
Hi all,

I am planning to dynamically increase or decrease the number of executors
allocated to an application during runtime, and it is similar to dynamic
resource allocation, which is only feasible in Spark on Yarn mode. Any
suggestions on how to implement this feature in Standalone mode?

My current problem is: I want to send a ADD_EXECUTOR command from scheduler
module (in CoarseGrainedSchedulerBackend.scala) to deploy module (in
Master.scala), but don't know how to communicate between the two
modules Great thanks for any suggestions!


Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Mridul Muralidharan
Thanks for detailing, definitely sounds better.
+1

Regards
Mridul

On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote:

 A single commit message consisting of:

 1. Pull request title (which includes JIRA number and component, e.g.
 [SPARK-1234][MLlib])

 2. Pull request description

 3. List of authors contributing to the patch

 The main thing that changes is 3: we used to also include the individual
 commits to the pull request branch that are squashed.


 On Sat, Jul 18, 2015 at 3:45 PM, Mridul Muralidharan mri...@gmail.com
 javascript:_e(%7B%7D,'cvml','mri...@gmail.com'); wrote:

 Just to clarify, the proposal is to have a single commit msg giving the
 jira and pr id?
 That sounds like a good change to have.

 Regards
 Mridul


 On Saturday, July 18, 2015, Reynold Xin r...@databricks.com
 javascript:_e(%7B%7D,'cvml','r...@databricks.com'); wrote:

 I took a look at the commit messages in git log -- it looks like the
 individual commit messages are not that useful to include, but do make the
 commit messages more verbose. They are usually just a bunch of extremely
 concise descriptions of bug fixes, merges, etc:

 cb3f12d [xxx] add whitespace
 6d874a6 [xxx] support pyspark for yarn-client

 89b01f5 [yyy] Update the unit test to add more cases
 275d252 [yyy] Address the comments
 7cc146d [yyy] Address the comments
 2624723 [yyy] Fix rebase conflict
 45befaa [yyy] Update the unit test
 bbc1c9c [yyy] Fix checkpointing doesn't retain driver port issue


 Anybody against removing those from the merge script so the log looks
 cleaner? If nobody feels strongly about this, we can just create a JIRA to
 remove them, and only keep the author names.





Re: If gmail, check sparm

2015-07-18 Thread Ted Yu
Interesting read. 

I did find a lot of Spark mails in Spam folder. 

Thanks Mridul 



 On Jul 18, 2015, at 10:25 AM, Mridul Muralidharan mri...@gmail.com wrote:
 
 https://plus.google.com/+LinusTorvalds/posts/DiG9qANf5PA
 
 I have noticed a bunch of mails from dev@ and github going to spam -
 including spark maliing list.
 Might be a good idea for dev, committers to check if they are missing
 things in their spam folder if on gmail.
 
 Regards,
 Mridul
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org