Re: Performance test for sort shuffle

2015-02-02 Thread Ewan Higgs
Hi Kannan, I have a branch here: https://github.com/ehiggs/spark/tree/terasort The code is in the examples. I don't do any fancy partitioning so it could be made quicker, I'm sure. But it should be a good baseline. I have a WIP PR for spark-perf but I'm having trouble building it there[1]. I

Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 11:13 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.1.x & 1.2.0 2.1. statisti

Can spark provide an option to start reduce stage early?

2015-02-02 Thread Xuelin Cao
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps* which can be used to start reducer stage when X% mappers are completed. By doing this, the data shuffling process is able to parallel with the map process. In a large multi-tenancy cluster, this option is usually tuned off.

IDF for ml pipeline

2015-02-02 Thread masaki rikitoku
Hi all I am trying the ml pipeline for text classfication now. recently, i succeed to execute the pipeline processing in ml packages, which consist of the original Japanese tokenizer, hashingTF, logisticRegression. then, i failed to executed the pipeline with idf in mllib package directly. To

[VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97 The release files, including signatures, digests, etc. can

[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
This is cancelled in favor of RC2. On Mon, Feb 2, 2015 at 8:50 PM, Patrick Wendell wrote: > The windows issue reported only affects actually running Spark on > Windows (not job submission). However, I agree it's worth cutting a > new RC. I'm going to cancel this vote and propose RC3 with a single

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
The windows issue reported only affects actually running Spark on Windows (not job submission). However, I agree it's worth cutting a new RC. I'm going to cancel this vote and propose RC3 with a single additional patch. Let's try to vote that through so we can ship Spark 1.2.1. - Patrick On Sat,

Temporary jenkins issue

2015-02-02 Thread Patrick Wendell
Hey All, I made a change to the Jenkins configuration that caused most builds to fail (attempting to enable a new plugin), I've reverted the change effective about 10 minutes ago. If you've seen recent build failures like below, this was caused by that change. Sorry about that. ERROR: Publi

Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
To reiterate, I'm asking from an experimental perspective. I'm not proposing we change Spark to build with Pants or anything like that. I'm interested in trying Pants out and I'm wondering if anyone else shares my interest or already has experience with Pants that they can share. On Mon Feb 02 20

Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
I'm asking from an experimental standpoint; this is not happening anytime soon. Of course, if the experiment turns out very well, Pants would replace both sbt and Maven (like it has at Twitter, for example). Pants also works with IDEs . On

Re: Building Spark with Pants

2015-02-02 Thread Stephen Boesch
There is a significant investment in sbt and maven - and they are not at all likely to be going away. A third build tool? Note that there is also the perspective of building within an IDE - which actually works presently for sbt and with a little bit of tweaking with maven as well. 2015-02-02 16:

Re: [spark-sql] JsonRDD

2015-02-02 Thread Reynold Xin
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util methods. The case sensitivity issues seem orthogonal, and would be great to be able to control that with a flag. On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov wrote: > Hey Spark developers, > > Is there a good reason fo

Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
Does anyone here have experience with Pants or interest in trying to build Spark with it? Pants has an interesting story. It was born at Twitter to help them build their Scala, Java, and Python projects as several independent components in one monolithic re

[spark-sql] JsonRDD

2015-02-02 Thread Daniil Osipov
Hey Spark developers, Is there a good reason for JsonRDD being a Scala object as opposed to class? Seems most other RDDs are classes, and can be extended. The reason I'm asking is that there is a problem with Hive interoperability with JSON DataFrames where jsonFile generates case sensitive schem

Performance test for sort shuffle

2015-02-02 Thread Kannan Rajah
Is there a recommended performance test for sort based shuffle? Something similar to terasort on Hadoop. I couldn't find one on the spark-perf code base. https://github.com/databricks/spark-perf -- Kannan

Additional fix for Avro IncompatibleClassChangeError (SPARK-3039)

2015-02-02 Thread M. Dale
SPARK-3039 "Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API" was marked resolved with Spark 1.2.0 release. However, when I download the pre-built Spark distro for Hadoop 2.4 and later (spark-1.2.0-bin-hadoop2.4.tgz) and run it against Avro code compiled agains

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
It's already fixed in the master branch. Sorry that we forgot to update this before releasing 1.2.0 and caused you trouble... Cheng On 2/2/15 2:03 PM, ankits wrote: Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guid

Re: Get size of rdd in memory

2015-02-02 Thread ankits
Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the "branch-1.2" branch, https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md "Note that if you call schemaRDD.cache() rather tha

Re: Spark Master Maven with YARN build is broken

2015-02-02 Thread Patrick Wendell
It's my fault, I'm sending a hot fix now. On Mon, Feb 2, 2015 at 1:44 PM, Nicholas Chammas wrote: > https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/ > > Is this is a known issue? It seems to have been broken since last nigh

Spark Master Maven with YARN build is broken

2015-02-02 Thread Nicholas Chammas
https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/ Is this is a known issue? It seems to have been broken since last night. Here’s a snippet from the build output of one of the builds

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| since Spark 1.2.0. The reason why your web UI didn’t show you the cached table is that both |cacheTable| and |sql("SELECT ...")| are lazy :-) Simply add a |.collect()| after the |sql(...)| call. Cheng On 2/2/15 12:23 PM, an

Re: Get size of rdd in memory

2015-02-02 Thread ankits
Thanks for your response. So AFAICT calling parallelize(1 to1024).map(i =>KV(i, i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of the schemardd in memory and parallelize(1 to1024).map(i =>KV(i, i.toString)).cache().count() will show me the size of a regular rdd. But

RE: Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi Patrick, Thanks a lot for your detailed explanation. For now we have such requirements: whitelist the application submitter, user resources (CPU, MEMORY) quotas, resources allocations in Spark Standalone mode. These are quite specific requirements for production-use, generally these problem

Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry, I think standalone mode will still add more features over time, but the goal isn't really for it to become equivalent to what Mesos/YARN are today. Or at least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks outside of Spark and become a general purpose resource

Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current Spa