Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
Easy or quicker way to build spark is sbt/sbt assembly/assembly Prashant Sharma On Mon, Sep 1, 2014 at 8:40 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If this is not a confirmed regression from 1.0.2, I think it's better to report it in a separate thread or JIRA. I believe serious regressions are generally the only reason to block a new release. Otherwise, if this is an old issue, it should be handled separately. 2014년 9월 1일 월요일, chutiumteng@gmail.com님이 작성한 메시지: i didn't tried with 1.0.2 it takes always too long to build spark assembly jars... more than 20min [info] Packaging /mnt/some-nfs/common/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar ... [info] Packaging /mnt/some-nfs/common/spark/examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar ... [info] Done packaging. [info] Done packaging. [success] Total time: 1582 s, completed Sep 1, 2014 1:39:21 PM is there some easily way to exclude some modules such as spark/examples or spark/external ? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8163.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:;
RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...
thanks a lot, Hao, finally solved this problem, changes of CSVSerDe are here: https://github.com/chutium/csv-serde/commit/22c667c003e705613c202355a8791978d790591e btw, add jar in spark hive or hive-thriftserver always doesn't work, we build the spark with libraryDependencies += csv-serde ... or maybe should try to add it to SPARK_CLASSPATH ? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8166.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Jira tickets for starter tasks
A number of folks have emailed me to add them, but I’ve been unable to find their usernanmes in the Apache JIRA. Note that you need to have an account at issues.apache.org, which may or may not have the same email / username as your accounts on any other Apache systems, including CWiki. Even if you are an Apache committer, you might not have an account on the JIRA unless you’ve created one. Therefore, if you want to be added to the “Contributors” group, I’ll need your actual JIRA username, which you can find at https://issues.apache.org/jira/secure/ViewProfile.jspa when signed in to JIRA. Note that you do not need to be a member of the contributors group in oder to open issues. If you want to be assigned an issue, you can also just comment in the issue itself and a JIRA administrator should be able to assign it to you. On August 29, 2014 at 10:05:54 AM, Josh Rosen (rosenvi...@gmail.com) wrote: Added you; you should be set! If anyone else wants me to add them, please email me off-list so that we don’t end up flooding the dev list with replies. Thanks! On August 29, 2014 at 10:03:41 AM, Ron's Yahoo! (zlgonza...@yahoo.com) wrote: Hi Josh, Can you add me as well? Thanks, Ron On Aug 28, 2014, at 3:56 PM, Josh Rosen rosenvi...@gmail.com wrote: A JIRA admin needs to add you to the ‘’Contributors” role group in order to allow you to assign issues to yourself. I’ve added this email address to that group, so you should be set! - Josh On August 28, 2014 at 3:52:57 PM, Bill Bejeck (bbej...@gmail.com) wrote: Hi, How do I get a starter task jira ticket assigned to myself? Or do I just do the work and issue a pull request with the associated jira number? Thanks, Bill
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1. Tested all the basic applications under both deploy modes (where applicable) in the following environments: - locally on OSX 10.9 - locally on Windows 8.1 - standalone cluster - yarn cluster built with Hadoop 2.4 From this front I have observed no regressions, and verified that standalone-cluster mode is now fixed. 2014-09-01 9:27 GMT-07:00 Prashant Sharma scrapco...@gmail.com: Easy or quicker way to build spark is sbt/sbt assembly/assembly Prashant Sharma On Mon, Sep 1, 2014 at 8:40 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: If this is not a confirmed regression from 1.0.2, I think it's better to report it in a separate thread or JIRA. I believe serious regressions are generally the only reason to block a new release. Otherwise, if this is an old issue, it should be handled separately. 2014년 9월 1일 월요일, chutiumteng@gmail.com님이 작성한 메시지: i didn't tried with 1.0.2 it takes always too long to build spark assembly jars... more than 20min [info] Packaging /mnt/some-nfs/common/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar ... [info] Packaging /mnt/some-nfs/common/spark/examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.3-mapr-3.0.3.jar ... [info] Done packaging. [info] Done packaging. [success] Total time: 1582 s, completed Sep 1, 2014 1:39:21 PM is there some easily way to exclude some modules such as spark/examples or spark/external ? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8163.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: dev-h...@spark.apache.org javascript:;
Run the Big Data Benchmark for new releases
What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick
Re: Run the Big Data Benchmark for new releases
Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick
Re: Run the Big Data Benchmark for new releases
Oh, that's sweet. So, a related question then. Did those tests pick up the performance issue reported in SPARK- https://issues.apache.org/jira/browse/SPARK-? Does it make sense to add a new test to cover that case? On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick
Re: Run the Big Data Benchmark for new releases
Yeah, this wasn't detected in our performance tests. We even have a test in PySpark that I would have though might catch this (it just schedules a bunch of really small tasks, similar to the regression case). https://github.com/databricks/spark-perf/blob/master/pyspark-tests/tests.py#L51 Anyways, Josh is trying to repro the regression to see if we can figure out what is going on. If we find something for sure we should add a test. On Mon, Sep 1, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Nope, actually, they didn't find that (they found some other things that were fixed, as well as some improvements). Feel free to send a PR, but it would be good to profile the issue first to understand what slowed down. (For example is the map phase taking longer or is it the reduce phase, is there some difference in lengths of specific tasks, etc). Matei On September 1, 2014 at 10:03:20 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: Oh, that's sweet. So, a related question then. Did those tests pick up the performance issue reported in SPARK-? Does it make sense to add a new test to cover that case? On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas (nicholas.cham...@gmail.com) wrote: What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Run the Big Data Benchmark for new releases
Alright, sounds good! I've created databricks/spark-perf/issues/9 https://github.com/databricks/spark-perf/issues/9 as a reminder for us to add a new test once we've root caused SPARK-. On Tue, Sep 2, 2014 at 1:07 AM, Patrick Wendell pwend...@gmail.com wrote: Yeah, this wasn't detected in our performance tests. We even have a test in PySpark that I would have though might catch this (it just schedules a bunch of really small tasks, similar to the regression case). https://github.com/databricks/spark-perf/blob/master/pyspark-tests/tests.py#L51 Anyways, Josh is trying to repro the regression to see if we can figure out what is going on. If we find something for sure we should add a test. On Mon, Sep 1, 2014 at 10:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Nope, actually, they didn't find that (they found some other things that were fixed, as well as some improvements). Feel free to send a PR, but it would be good to profile the issue first to understand what slowed down. (For example is the map phase taking longer or is it the reduce phase, is there some difference in lengths of specific tasks, etc). Matei On September 1, 2014 at 10:03:20 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: Oh, that's sweet. So, a related question then. Did those tests pick up the performance issue reported in SPARK-? Does it make sense to add a new test to cover that case? On Tue, Sep 2, 2014 at 12:29 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Nicholas, At Databricks we already run https://github.com/databricks/spark-perf for each release, which is a more comprehensive performance test suite. Matei On September 1, 2014 at 8:22:05 PM, Nicholas Chammas ( nicholas.cham...@gmail.com) wrote: What do people think of running the Big Data Benchmark https://amplab.cs.berkeley.edu/benchmark/ (repo https://github.com/amplab/benchmark) as part of preparing every new release of Spark? We'd run it just for Spark and effectively use it as another type of test to track any performance progress or regressions from release to release. Would doing such a thing be valuable? Do we already have a way of benchmarking Spark performance that we use regularly? Nick