Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-01 Thread Prashant Sharma
Easy or quicker way to build spark is

sbt/sbt assembly/assembly

Prashant Sharma




On Mon, Sep 1, 2014 at 8:40 PM, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 If this is not a confirmed regression from 1.0.2, I think it's better to
 report it in a separate thread or JIRA.

 I believe serious regressions are generally the only reason to block a new
 release. Otherwise, if this is an old issue, it should be handled
 separately.

 2014년 9월 1일 월요일, chutiumteng@gmail.com님이 작성한 메시지:

  i didn't tried with 1.0.2
 
  it takes always too long to build spark assembly jars... more than 20min
 
  [info] Packaging
 
 
 /mnt/some-nfs/common/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar
  ...
  [info] Packaging
 
 
 /mnt/some-nfs/common/spark/examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar
  ...
  [info] Done packaging.
  [info] Done packaging.
  [success] Total time: 1582 s, completed Sep 1, 2014 1:39:21 PM
 
  is there some easily way to exclude some modules such as spark/examples
 or
  spark/external ?
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8163.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
 
 



RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-09-01 Thread chutium
thanks a lot, Hao, finally solved this problem, changes of CSVSerDe are here:
https://github.com/chutium/csv-serde/commit/22c667c003e705613c202355a8791978d790591e

btw, add jar in spark hive or hive-thriftserver always doesn't work, we
build the spark with libraryDependencies += csv-serde ...

or maybe should try to add it to SPARK_CLASSPATH ?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8166.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Jira tickets for starter tasks

2014-09-01 Thread Josh Rosen
A number of folks have emailed me to add them, but I’ve been unable to find 
their usernanmes in the Apache JIRA.  Note that you need to have an account at 
issues.apache.org, which may or may not have the same email / username as your 
accounts on any other Apache systems, including CWiki.  Even if you are an 
Apache committer, you might not have an account on the JIRA unless you’ve 
created one.

Therefore, if you want to be added to the “Contributors” group, I’ll need your 
actual JIRA username, which you can find at 
https://issues.apache.org/jira/secure/ViewProfile.jspa when signed in to JIRA.

Note that you do not need to be a member of the contributors group in oder to 
open issues.  If you want to be assigned an issue, you can also just comment in 
the issue itself and a JIRA administrator should be able to assign it to you.

On August 29, 2014 at 10:05:54 AM, Josh Rosen (rosenvi...@gmail.com) wrote:
Added you; you should be set!

If anyone else wants me to add them, please email me off-list so that we don’t 
end up flooding the dev list with replies. Thanks!


On August 29, 2014 at 10:03:41 AM, Ron's Yahoo! (zlgonza...@yahoo.com) wrote:

Hi Josh,
Can you add me as well?

Thanks,
Ron

On Aug 28, 2014, at 3:56 PM, Josh Rosen rosenvi...@gmail.com wrote:

 A JIRA admin needs to add you to the ‘’Contributors” role group in order to 
 allow you to assign issues to yourself. I’ve added this email address to that 
 group, so you should be set!

 - Josh


 On August 28, 2014 at 3:52:57 PM, Bill Bejeck (bbej...@gmail.com) wrote:

 Hi,

 How do I get a starter task jira ticket assigned to myself? Or do I just do
 the work and issue a pull request with the associated jira number?

 Thanks,
 Bill



Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-01 Thread Andrew Or
+1. Tested all the basic applications under both deploy modes (where
applicable) in the following environments:

- locally on OSX 10.9
- locally on Windows 8.1
- standalone cluster
- yarn cluster built with Hadoop 2.4

From this front I have observed no regressions, and verified that
standalone-cluster mode is now fixed.



2014-09-01 9:27 GMT-07:00 Prashant Sharma scrapco...@gmail.com:

 Easy or quicker way to build spark is

 sbt/sbt assembly/assembly

 Prashant Sharma




 On Mon, Sep 1, 2014 at 8:40 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  If this is not a confirmed regression from 1.0.2, I think it's better to
  report it in a separate thread or JIRA.
 
  I believe serious regressions are generally the only reason to block a
 new
  release. Otherwise, if this is an old issue, it should be handled
  separately.
 
  2014년 9월 1일 월요일, chutiumteng@gmail.com님이 작성한 메시지:
 
   i didn't tried with 1.0.2
  
   it takes always too long to build spark assembly jars... more than
 20min
  
   [info] Packaging
  
  
 
 /mnt/some-nfs/common/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar
   ...
   [info] Packaging
  
  
 
 /mnt/some-nfs/common/spark/examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar
   ...
   [info] Done packaging.
   [info] Done packaging.
   [success] Total time: 1582 s, completed Sep 1, 2014 1:39:21 PM
  
   is there some easily way to exclude some modules such as spark/examples
  or
   spark/external ?
  
  
  
   --
   View this message in context:
  
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8163.html
   Sent from the Apache Spark Developers List mailing list archive at
   Nabble.com.
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 javascript:;
   For additional commands, e-mail: dev-h...@spark.apache.org
  javascript:;
  
  
 



Run the Big Data Benchmark for new releases

2014-09-01 Thread Nicholas Chammas
What do people think of running the Big Data Benchmark
https://amplab.cs.berkeley.edu/benchmark/ (repo
https://github.com/amplab/benchmark) as part of preparing every new
release of Spark?

We'd run it just for Spark and effectively use it as another type of test
to track any performance progress or regressions from release to release.

Would doing such a thing be valuable? Do we already have a way of
benchmarking Spark performance that we use regularly?

Nick


Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Matei Zaharia
Hi Nicholas,

At Databricks we already run https://github.com/databricks/spark-perf for each 
release, which is a more comprehensive performance test suite.

Matei

On September 1, 2014 at 8:22:05 PM, Nicholas Chammas 
(nicholas.cham...@gmail.com) wrote:

What do people think of running the Big Data Benchmark  
https://amplab.cs.berkeley.edu/benchmark/ (repo  
https://github.com/amplab/benchmark) as part of preparing every new  
release of Spark?  

We'd run it just for Spark and effectively use it as another type of test  
to track any performance progress or regressions from release to release.  

Would doing such a thing be valuable? Do we already have a way of  
benchmarking Spark performance that we use regularly?  

Nick  


Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Nicholas Chammas
Oh, that's sweet. So, a related question then.

Did those tests pick up the performance issue reported in SPARK-
https://issues.apache.org/jira/browse/SPARK-? Does it make sense to
add a new test to cover that case?


On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi Nicholas,

 At Databricks we already run https://github.com/databricks/spark-perf for
 each release, which is a more comprehensive performance test suite.

 Matei

 On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:

 What do people think of running the Big Data Benchmark
 https://amplab.cs.berkeley.edu/benchmark/ (repo
 https://github.com/amplab/benchmark) as part of preparing every new
 release of Spark?

 We'd run it just for Spark and effectively use it as another type of test
 to track any performance progress or regressions from release to release.

 Would doing such a thing be valuable? Do we already have a way of
 benchmarking Spark performance that we use regularly?

 Nick




Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Patrick Wendell
Yeah, this wasn't detected in our performance tests. We even have a
test in PySpark that I would have though might catch this (it just
schedules a bunch of really small tasks, similar to the regression
case).

https://github.com/databricks/spark-perf/blob/master/pyspark-tests/tests.py#L51

Anyways, Josh is trying to repro the regression to see if we can
figure out what is going on. If we find something for sure we should
add a test.

On Mon, Sep 1, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Nope, actually, they didn't find that (they found some other things that were 
 fixed, as well as some improvements). Feel free to send a PR, but it would be 
 good to profile the issue first to understand what slowed down. (For example 
 is the map phase taking longer or is it the reduce phase, is there some 
 difference in lengths of specific tasks, etc).

 Matei

 On September 1, 2014 at 10:03:20 PM, Nicholas Chammas 
 (nicholas.cham...@gmail.com) wrote:

 Oh, that's sweet. So, a related question then.

 Did those tests pick up the performance issue reported in SPARK-? Does it 
 make sense to add a new test to cover that case?


 On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 Hi Nicholas,

 At Databricks we already run https://github.com/databricks/spark-perf for 
 each release, which is a more comprehensive performance test suite.

 Matei

 On September 1, 2014 at 8:22:05 PM, Nicholas Chammas 
 (nicholas.cham...@gmail.com) wrote:

 What do people think of running the Big Data Benchmark
 https://amplab.cs.berkeley.edu/benchmark/ (repo
 https://github.com/amplab/benchmark) as part of preparing every new
 release of Spark?

 We'd run it just for Spark and effectively use it as another type of test
 to track any performance progress or regressions from release to release.

 Would doing such a thing be valuable? Do we already have a way of
 benchmarking Spark performance that we use regularly?

 Nick


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Run the Big Data Benchmark for new releases

2014-09-01 Thread Nicholas Chammas
Alright, sounds good! I've created databricks/spark-perf/issues/9
https://github.com/databricks/spark-perf/issues/9 as a reminder for us to
add a new test once we've root caused SPARK-.


On Tue, Sep 2, 2014 at 1:07 AM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah, this wasn't detected in our performance tests. We even have a
 test in PySpark that I would have though might catch this (it just
 schedules a bunch of really small tasks, similar to the regression
 case).


 https://github.com/databricks/spark-perf/blob/master/pyspark-tests/tests.py#L51

 Anyways, Josh is trying to repro the regression to see if we can
 figure out what is going on. If we find something for sure we should
 add a test.

 On Mon, Sep 1, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Nope, actually, they didn't find that (they found some other things that
 were fixed, as well as some improvements). Feel free to send a PR, but it
 would be good to profile the issue first to understand what slowed down.
 (For example is the map phase taking longer or is it the reduce phase, is
 there some difference in lengths of specific tasks, etc).
 
  Matei
 
  On September 1, 2014 at 10:03:20 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:
 
  Oh, that's sweet. So, a related question then.
 
  Did those tests pick up the performance issue reported in SPARK-?
 Does it make sense to add a new test to cover that case?
 
 
  On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  Hi Nicholas,
 
  At Databricks we already run https://github.com/databricks/spark-perf
 for each release, which is a more comprehensive performance test suite.
 
  Matei
 
  On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (
 nicholas.cham...@gmail.com) wrote:
 
  What do people think of running the Big Data Benchmark
  https://amplab.cs.berkeley.edu/benchmark/ (repo
  https://github.com/amplab/benchmark) as part of preparing every new
  release of Spark?
 
  We'd run it just for Spark and effectively use it as another type of test
  to track any performance progress or regressions from release to release.
 
  Would doing such a thing be valuable? Do we already have a way of
  benchmarking Spark performance that we use regularly?
 
  Nick