Additional fix for Avro IncompatibleClassChangeError (SPARK-3039)

2015-02-02 Thread M. Dale
SPARK-3039 Spark assembly for new hadoop API (hadoop 2) contains 
avro-mapred for hadoop 1 API

was marked resolved with Spark 1.2.0 release. However, when I download the
pre-built Spark distro for Hadoop 2.4 and later 
(spark-1.2.0-bin-hadoop2.4.tgz) and run it

against Avro code compiled against Hadoop 2.4/new Hadoop API I still get:

java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)


TaskAttemptContext was a class in the Hadoop 1.x series but became an 
interface

in Hadoop 2.x. Therefore there is a avro-mapred-1.7.6.jar and
avro-mapred-1.7.6-hadoop2.jar. For Hadoop 2.x the 
avro-mapred-1.7.6-hadoop2.jar

is needed.

So it seemed that spark-assembly-1.2.0-hadoop2.4.0.jar still did not contain
the org.apache.avro.mapreduce.AvroRecordReaderBase from 
avro-mapred-1.7.6-hadoop2.jar.


I then downloaded the source code and compiled with:
mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests clean package

The hadoop-2.4 profile sets:
avro.mapred.classifierhadoop2/avro.mapred.classifier which through
dependency management should pull in the right hadoop2 version:

dependency
groupIdorg.apache.avro/groupId
artifactIdavro-mapred/artifactId
version${avro.version}/version
classifier${avro.mapred.classifier}/classifier
exclusions

However, same IncompatibleClassChangeError after replacing the assembly jar.

I had cleaned my local ~/.m2/repository before the build and found that for
avro-mapred both 1.7.5 (no extension, i.e. hadoop1) and 1.7.6 (hadoop2) had
been downloaded. That seemed a likely culprit.

After installing the created jar files into my local repo (had to handcopy
poms/jars for repl/yarn subprojects) and then running:

mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests dependency:tree 
-Dincludes=org.apache.avro:avro-mapred


Building Spark Project Hive 1.2.0
[INFO] 


[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ 
spark-hive_2.10 ---

[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]

Showed that hive-exec brought in the avro-mapred-1.7.5.jar (hadoop1). 
Fix for

spark 1.2.x:

spark-1.2.0/sql/hive/pom.xml

dependency
  groupIdorg.spark-project.hive/groupId
  artifactIdhive-exec/artifactId
  version${hive.version}/version
  exclusions
exclusion
  groupIdcommons-logging/groupId
  artifactIdcommons-logging/artifactId
/exclusion
exclusion
  groupIdcom.esotericsoftware.kryo/groupId
  artifactIdkryo/artifactId
/exclusion
exclusion
  groupIdorg.apache.avro/groupId
  artifactIdavro-mapred/artifactId
/exclusion
  /exclusions
/dependency

 Just add the last exclusion for avro-mapred (comparison at 
https://github.com/medale/spark/compare/apache:v1.2.1-rc2...medale:avro-hadoop2-v1.2.1-rc2).

 I was able to build and run against that fix with Avro code.

 Fix for current master: https://github.com/apache/spark/pull/4315

 Any feedback much appreciated,
 Markus

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
It's already fixed in the master branch. Sorry that we forgot to update 
this before releasing 1.2.0 and caused you trouble...


Cheng

On 2/2/15 2:03 PM, ankits wrote:

Great, thank you very much. I was confused because this is in the docs:

https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
branch-1.2 branch,
https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md

Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will not be cached using the in-memory
columnar format, and therefore sqlContext.cacheTable(...) is strongly
recommended for this use case..

If this is no longer accurate, i could make a PR to remove it.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| 
since Spark 1.2.0. The reason why your web UI didn’t show you the cached 
table is that both |cacheTable| and |sql(SELECT ...)| are lazy :-) 
Simply add a |.collect()| after the |sql(...)| call.


Cheng

On 2/2/15 12:23 PM, ankits wrote:


Thanks for your response. So AFAICT

calling parallelize(1  to1024).map(i =KV(i,
i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
the schemardd in memory

and parallelize(1  to1024).map(i =KV(i, i.toString)).cache().count()  will
show me the size of a regular rdd.

But this will not show us the size when using cacheTable() right? Like if i
call

parallelize(1  to1024).map(i =KV(i,
i.toString)).toSchemaRDD.registerTempTable(test)
sqc.cacheTable(test)
sqc.sql(SELECT COUNT(*) FROM test)

the web UI does not show us the size of the cached table.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



​


Re: Get size of rdd in memory

2015-02-02 Thread ankits
Great, thank you very much. I was confused because this is in the docs:

https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the
branch-1.2 branch,
https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md

Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will not be cached using the in-memory
columnar format, and therefore sqlContext.cacheTable(...) is strongly
recommended for this use case..

If this is no longer accurate, i could make a PR to remove it.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Performance test for sort shuffle

2015-02-02 Thread Kannan Rajah
Is there a recommended performance test for sort based shuffle? Something
similar to terasort on Hadoop. I couldn't find one on the spark-perf code
base.

https://github.com/databricks/spark-perf

--
Kannan


Re: Spark Master Maven with YARN build is broken

2015-02-02 Thread Patrick Wendell
It's my fault, I'm sending a hot fix now.

On Mon, Feb 2, 2015 at 1:44 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/

 Is this is a known issue? It seems to have been broken since last night.

 Here's a snippet from the build output of one of the builds
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1308/console
 :

 [error] bad symbolic reference. A signature in WebUI.class refers to
 term eclipse
 [error] in package org which is not available.
 [error] It may be completely missing from the current classpath, or
 the version on
 [error] the classpath might be incompatible with the version used when
 compiling WebUI.class.
 [error] bad symbolic reference. A signature in WebUI.class refers to term 
 jetty
 [error] in value org.eclipse which is not available.
 [error] It may be completely missing from the current classpath, or
 the version on
 [error] the classpath might be incompatible with the version used when
 compiling WebUI.class.
 [error]
 [error]  while compiling:
 /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 [error] during phase: erasure
 [error]  library version: version 2.10.4
 [error] compiler version: version 2.10.4

 Nick


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[spark-sql] JsonRDD

2015-02-02 Thread Daniil Osipov
Hey Spark developers,

Is there a good reason for JsonRDD being a Scala object as opposed to
class? Seems most other RDDs are classes, and can be extended.

The reason I'm asking is that there is a problem with Hive interoperability
with JSON DataFrames where jsonFile generates case sensitive schema, while
Hive expects case insensitive and fails with an exception during
saveAsTable if there are two columns with the same name in different case.

I'm trying to resolve the problem, but that requires me to extend JsonRDD,
which I can't do. Other RDDs are subclass friendly, why is JsonRDD
different?

Dan


Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
I'm asking from an experimental standpoint; this is not happening anytime
soon.

Of course, if the experiment turns out very well, Pants would replace both
sbt and Maven (like it has at Twitter, for example). Pants also works with
IDEs http://pantsbuild.github.io/index.html#using-pants-with.

On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com wrote:

 There is a significant investment in sbt and maven - and they are not at
 all likely to be going away. A third build tool?  Note that there is also
 the perspective of building within an IDE - which actually works presently
 for sbt and with a little bit of tweaking with maven as well.

 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Does anyone here have experience with Pants

 http://pantsbuild.github.io/index.html or interest in trying to build


 Spark with it?

 Pants has an interesting story. It was born at Twitter to help them build
 their Scala, Java, and Python projects as several independent components
 in
 one monolithic repo. (It was inspired by a similar build tool at Google
 called blaze.) The mix of languages and sub-projects at Twitter seems
 similar to the breakdown we have in Spark.

 Pants has an interesting take on how a build system should work, and
 Twitter and Foursquare (who use Pants as their primary build tool) claim
 it
 helps enforce better build hygiene and maintainability.

 Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter

 https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter
 
- Getting Started with the Pants Build System: Why Pants?

 https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants
 



 At some point I may take a shot at converting Spark to use Pants as an
 experiment and just see what it’s like.

 Nick
 ​




Re: Building Spark with Pants

2015-02-02 Thread Stephen Boesch
There is a significant investment in sbt and maven - and they are not at
all likely to be going away. A third build tool?  Note that there is also
the perspective of building within an IDE - which actually works presently
for sbt and with a little bit of tweaking with maven as well.

2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Does anyone here have experience with Pants
 http://pantsbuild.github.io/index.html or interest in trying to build
 Spark with it?

 Pants has an interesting story. It was born at Twitter to help them build
 their Scala, Java, and Python projects as several independent components in
 one monolithic repo. (It was inspired by a similar build tool at Google
 called blaze.) The mix of languages and sub-projects at Twitter seems
 similar to the breakdown we have in Spark.

 Pants has an interesting take on how a build system should work, and
 Twitter and Foursquare (who use Pants as their primary build tool) claim it
 helps enforce better build hygiene and maintainability.

 Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter

 https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter
 
- Getting Started with the Pants Build System: Why Pants?

 https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants
 

 At some point I may take a shot at converting Spark to use Pants as an
 experiment and just see what it’s like.

 Nick
 ​



Re: [spark-sql] JsonRDD

2015-02-02 Thread Reynold Xin
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util
methods.

The case sensitivity issues seem orthogonal, and would be great to be able
to control that with a flag.


On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:

 Hey Spark developers,

 Is there a good reason for JsonRDD being a Scala object as opposed to
 class? Seems most other RDDs are classes, and can be extended.

 The reason I'm asking is that there is a problem with Hive interoperability
 with JSON DataFrames where jsonFile generates case sensitive schema, while
 Hive expects case insensitive and fails with an exception during
 saveAsTable if there are two columns with the same name in different case.

 I'm trying to resolve the problem, but that requires me to extend JsonRDD,
 which I can't do. Other RDDs are subclass friendly, why is JsonRDD
 different?

 Dan



Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
To reiterate, I'm asking from an experimental perspective. I'm not
proposing we change Spark to build with Pants or anything like that.

I'm interested in trying Pants out and I'm wondering if anyone else shares
my interest or already has experience with Pants that they can share.

On Mon Feb 02 2015 at 4:40:45 PM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I'm asking from an experimental standpoint; this is not happening anytime
 soon.

 Of course, if the experiment turns out very well, Pants would replace both
 sbt and Maven (like it has at Twitter, for example). Pants also works
 with IDEs http://pantsbuild.github.io/index.html#using-pants-with.

 On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com wrote:

 There is a significant investment in sbt and maven - and they are not at
 all likely to be going away. A third build tool?  Note that there is also
 the perspective of building within an IDE - which actually works presently
 for sbt and with a little bit of tweaking with maven as well.

 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Does anyone here have experience with Pants

 http://pantsbuild.github.io/index.html or interest in trying to build


 Spark with it?

 Pants has an interesting story. It was born at Twitter to help them build
 their Scala, Java, and Python projects as several independent components
 in
 one monolithic repo. (It was inspired by a similar build tool at Google
 called blaze.) The mix of languages and sub-projects at Twitter seems
 similar to the breakdown we have in Spark.

 Pants has an interesting take on how a build system should work, and
 Twitter and Foursquare (who use Pants as their primary build tool) claim
 it
 helps enforce better build hygiene and maintainability.

 Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter
https://engineering.twitter.com/university/videos/the-
 pants-build-tool-at-twitter
- Getting Started with the Pants Build System: Why Pants?
https://engineering.twitter.com/university/videos/getting-
 started-with-the-pants-build-system-why-pants



 At some point I may take a shot at converting Spark to use Pants as an
 experiment and just see what it’s like.

 Nick
 ​




Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
Does anyone here have experience with Pants
http://pantsbuild.github.io/index.html or interest in trying to build
Spark with it?

Pants has an interesting story. It was born at Twitter to help them build
their Scala, Java, and Python projects as several independent components in
one monolithic repo. (It was inspired by a similar build tool at Google
called blaze.) The mix of languages and sub-projects at Twitter seems
similar to the breakdown we have in Spark.

Pants has an interesting take on how a build system should work, and
Twitter and Foursquare (who use Pants as their primary build tool) claim it
helps enforce better build hygiene and maintainability.

Some relevant talks:

   - Building Scala Hygienically with Pants
   https://www.youtube.com/watch?v=ukqke8iTuH0
   - The Pants Build Tool at Twitter
   
https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter
   - Getting Started with the Pants Build System: Why Pants?
   
https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants

At some point I may take a shot at converting Spark to use Pants as an
experiment and just see what it’s like.

Nick
​


Temporary jenkins issue

2015-02-02 Thread Patrick Wendell
Hey All,

I made a change to the Jenkins configuration that caused most builds
to fail (attempting to enable a new plugin), I've reverted the change
effective about 10 minutes ago.

If you've seen recent build failures like below, this was caused by
that change. Sorry about that.


ERROR: Publisher
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver
aborted due to exception
java.lang.NoSuchMethodError:
hudson.model.AbstractBuild.getTestResultAction()Lhudson/tasks/test/AbstractTestResultAction;
at 
com.google.jenkins.flakyTestHandler.plugin.FlakyTestResultAction.init(FlakyTestResultAction.java:78)
at 
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver.perform(JUnitFlakyResultArchiver.java:89)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
at hudson.model.Build$BuildExecution.post2(Build.java:183)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683)
at hudson.model.Run.execute(Run.java:1784)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:89)
at hudson.model.Executor.run(Executor.java:240)


- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 11:13 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11
2. Tested pyspark, mlib - running as well as compare results with 1.1.x 
1.2.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Fixed : org.apache.spark.SparkException in zip !
2.5. rdd operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lmbda) with itertools
OK
3. Scala - MLLib
3.1. statistics (min,max,mean,Pearson,Spearman) OK
3.2. LinearRegressionWIthSGD OK
3.3. Decision Tree OK
3.4. KMeans OK
3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK

Cheers
k/


On Mon, Feb 2, 2015 at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.2.1!

 The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1065/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/

 Changes from rc2:
 A single patch fixing a windows issue.

 Please vote on releasing this package as Apache Spark 1.2.1!

 The vote is open until Friday, February 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...

 For a list of fixes in this release, see http://s.apache.org/Mpn.

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
This is cancelled in favor of RC2.

On Mon, Feb 2, 2015 at 8:50 PM, Patrick Wendell pwend...@gmail.com wrote:
 The windows issue reported only affects actually running Spark on
 Windows (not job submission). However, I agree it's worth cutting a
 new RC. I'm going to cancel this vote and propose RC3 with a single
 additional patch. Let's try to vote that through so we can ship Spark
 1.2.1.

 - Patrick

 On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 This looks like a pretty serious problem, thanks! Glad people are testing on 
 Windows.

 Matei

 On Jan 31, 2015, at 11:57 AM, MartinWeindel martin.wein...@gmail.com 
 wrote:

 FYI: Spark 1.2.1rc2 does not work on Windows!

 On creating a Spark context you get following log output on my Windows
 machine:
 INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
 ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in
 C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory.
 ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any
 local dir.

 I have already located the cause. A newly added function chmod700() in
 org.apache.util.Utils uses functionality which only works on a Unix file
 system.

 See also pull request [https://github.com/apache/spark/pull/4299] for my
 suggestion how to resolve the issue.

 Best regards,

 Martin Weindel



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html
 Sent from the Apache Spark Developers List mailing list archive at 
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
The windows issue reported only affects actually running Spark on
Windows (not job submission). However, I agree it's worth cutting a
new RC. I'm going to cancel this vote and propose RC3 with a single
additional patch. Let's try to vote that through so we can ship Spark
1.2.1.

- Patrick

On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 This looks like a pretty serious problem, thanks! Glad people are testing on 
 Windows.

 Matei

 On Jan 31, 2015, at 11:57 AM, MartinWeindel martin.wein...@gmail.com wrote:

 FYI: Spark 1.2.1rc2 does not work on Windows!

 On creating a Spark context you get following log output on my Windows
 machine:
 INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
 ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in
 C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory.
 ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any
 local dir.

 I have already located the cause. A newly added function chmod700() in
 org.apache.util.Utils uses functionality which only works on a Unix file
 system.

 See also pull request [https://github.com/apache/spark/pull/4299] for my
 suggestion how to resolve the issue.

 Best regards,

 Martin Weindel



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html
 Sent from the Apache Spark Developers List mailing list archive at 
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1065/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/

Changes from rc2:
A single patch fixing a windows issue.

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until Friday, February 06, at 05:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



IDF for ml pipeline

2015-02-02 Thread masaki rikitoku
Hi all

I am trying the ml pipeline for text classfication now.

recently, i succeed to execute the pipeline processing in ml packages,
which consist of the original Japanese tokenizer, hashingTF,
logisticRegression.

then,  i failed to  executed the pipeline with idf in mllib package directly.

To use the idf feature in ml package,
do i have to implement the wrapper for idf in ml package like the hashingTF?

best

Masaki Rikitoku

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Performance test for sort shuffle

2015-02-02 Thread Ewan Higgs

Hi Kannan,
I have a branch here:

https://github.com/ehiggs/spark/tree/terasort

The code is in the examples. I don't do any fancy partitioning so it 
could be made quicker, I'm sure. But it should be a good baseline.


I have a WIP PR for spark-perf but I'm having trouble building it 
there[1]. I put it on the back burner until someone can get back to me 
on it.


Yours,
Ewan Higgs

[1] 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSpark-perf-terasort-WIP-branch-tt10105.html


On 02/02/15 23:26, Kannan Rajah wrote:

Is there a recommended performance test for sort based shuffle? Something
similar to terasort on Hadoop. I couldn't find one on the spark-perf code
base.

https://github.com/databricks/spark-perf

--
Kannan




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Get size of rdd in memory

2015-02-02 Thread ankits
Thanks for your response. So AFAICT 

calling parallelize(1  to1024).map(i =KV(i,
i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of
the schemardd in memory

and parallelize(1  to1024).map(i =KV(i, i.toString)).cache().count()  will
show me the size of a regular rdd.

But this will not show us the size when using cacheTable() right? Like if i
call

parallelize(1  to1024).map(i =KV(i,
i.toString)).toSchemaRDD.registerTempTable(test)
sqc.cacheTable(test)
sqc.sql(SELECT COUNT(*) FROM test)

the web UI does not show us the size of the cached table. 





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Can spark provide an option to start reduce stage early?

2015-02-02 Thread Xuelin Cao
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps*

which can be used to start reducer stage when X% mappers are completed. By
doing this, the data shuffling process is able to parallel with the map
process.

In a large multi-tenancy cluster, this option is usually tuned off. But, in
some cases, turn on the option could accelerate some high priority jobs.

Will spark provide similar option?


Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi all,

I have some questions about the future development of Spark's standalone 
resource scheduler. We've heard some users have the requirements to have 
multi-tenant support in standalone mode, like multi-user management, resource 
management and isolation, whitelist of users. Seems current Spark standalone do 
not support such kind of functionalities, while resource schedulers like Yarn 
offers such kind of advanced managements, I'm not sure what's the future target 
of standalone resource scheduler, will it only target on simple implementation, 
and for advanced usage shift to YARN? Or will it plan to add some simple 
multi-tenant related functionalities?

Thanks a lot for your comments.

BR
Jerry


Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry,

I think standalone mode will still add more features over time, but
the goal isn't really for it to become equivalent to what Mesos/YARN
are today. Or at least, I doubt Spark Standalone will ever attempt to
manage _other_ frameworks outside of Spark and become a general
purpose resource manager.

In terms of having better support for multi tenancy, meaning multiple
*Spark* instances, this is something I think could be in scope in the
future. For instance, we added H/A to the standalone scheduler a while
back, because it let us support H/A streaming apps in a totally native
way. It's a trade off of adding new features and keeping the scheduler
very simple and easy to use. We've tended to bias towards simplicity
as the main goal, since this is something we want to be really easy
out of the box.

One thing to point out, a lot of people use the standalone mode with
some coarser grained scheduler, such as running in a cloud service. In
this case they really just want a simple inner cluster manager. This
may even be the majority of all Spark installations. This is slightly
different than Hadoop environments, where they might just want nice
integration into the existing Hadoop stack via something like YARN.

- Patrick

On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi all,



 I have some questions about the future development of Spark's standalone
 resource scheduler. We've heard some users have the requirements to have
 multi-tenant support in standalone mode, like multi-user management,
 resource management and isolation, whitelist of users. Seems current Spark
 standalone do not support such kind of functionalities, while resource
 schedulers like Yarn offers such kind of advanced managements, I'm not sure
 what's the future target of standalone resource scheduler, will it only
 target on simple implementation, and for advanced usage shift to YARN? Or
 will it plan to add some simple multi-tenant related functionalities?



 Thanks a lot for your comments.



 BR

 Jerry

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi Patrick,

Thanks a lot for your detailed explanation. For now we have such requirements: 
whitelist the application submitter, user resources (CPU, MEMORY) quotas, 
resources allocations in Spark Standalone mode. These are quite specific 
requirements for production-use, generally these problem will become whether we 
need to offer a more advanced resource scheduler compared to current simple 
FIFO one. I think our aim is to not provide a general resource scheduler like 
Mesos/Yarn, we only support Spark, but we hope to add some Mesos/Yarn 
functionalities to better use of Spark standalone mode.

I admitted that resource scheduler may have some overlaps with cloud manager, 
whether to offer a powerful scheduler or use cloud manager is really a dilemma.

I think we can break down to some small features to improve the standalone 
mode. What's your opinion?

Thanks
Jerry

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Monday, February 2, 2015 4:49 PM
To: Shao, Saisai
Cc: dev@spark.apache.org; u...@spark.apache.org
Subject: Re: Questions about Spark standalone resource scheduler

Hey Jerry,

I think standalone mode will still add more features over time, but the goal 
isn't really for it to become equivalent to what Mesos/YARN are today. Or at 
least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks 
outside of Spark and become a general purpose resource manager.

In terms of having better support for multi tenancy, meaning multiple
*Spark* instances, this is something I think could be in scope in the future. 
For instance, we added H/A to the standalone scheduler a while back, because it 
let us support H/A streaming apps in a totally native way. It's a trade off of 
adding new features and keeping the scheduler very simple and easy to use. 
We've tended to bias towards simplicity as the main goal, since this is 
something we want to be really easy out of the box.

One thing to point out, a lot of people use the standalone mode with some 
coarser grained scheduler, such as running in a cloud service. In this case 
they really just want a simple inner cluster manager. This may even be the 
majority of all Spark installations. This is slightly different than Hadoop 
environments, where they might just want nice integration into the existing 
Hadoop stack via something like YARN.

- Patrick

On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai saisai.s...@intel.com wrote:
 Hi all,



 I have some questions about the future development of Spark's 
 standalone resource scheduler. We've heard some users have the 
 requirements to have multi-tenant support in standalone mode, like 
 multi-user management, resource management and isolation, whitelist of 
 users. Seems current Spark standalone do not support such kind of 
 functionalities, while resource schedulers like Yarn offers such kind 
 of advanced managements, I'm not sure what's the future target of 
 standalone resource scheduler, will it only target on simple 
 implementation, and for advanced usage shift to YARN? Or will it plan to add 
 some simple multi-tenant related functionalities?



 Thanks a lot for your comments.



 BR

 Jerry

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org