Getting the execution times of spark job
Hi, I have been playing around with spark for a couple of days. I am using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the implementation is to run Hive queries on Spark. I used JavaHiveContext to achieve this (As per the examples). I have 2 questions. 1. I am wondering how I could get the execution times of a spark job? Does Spark provide monitoring facilities in the form of an API? 2. I used a laymen way to get the execution times by enclosing a JavaHiveContext.hql method with System.nanoTime() as follows long start, end; JavaHiveContext hiveCtx; JavaSchemaRDD hiveResult; start = System.nanoTime(); hiveResult = hiveCtx.hql(query); end = System.nanoTime(); System.out.println(start-end); But the result I got is drastically different from the execution times recorded in SparkUI. Can you please explain this disparity? Look forward to hearing from you. rgds -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44
Re: Getting the execution times of spark job
For your second question: hql() (as well as sql()) does not launch a Spark job immediately; instead, it fires off the Spark SQL parser/optimizer/planner pipeline first, and a Spark job will be started after the a physical execution plan is selected. Therefore, your hand-rolled end-to-end measurement includes the time to go through the Spark SQL code path, and the times reported inside the UI are the execution times of the Spark job(s) only. On Mon, Sep 1, 2014 at 11:45 PM, Niranda Perera nira...@wso2.com wrote: Hi, I have been playing around with spark for a couple of days. I am using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the implementation is to run Hive queries on Spark. I used JavaHiveContext to achieve this (As per the examples). I have 2 questions. 1. I am wondering how I could get the execution times of a spark job? Does Spark provide monitoring facilities in the form of an API? 2. I used a laymen way to get the execution times by enclosing a JavaHiveContext.hql method with System.nanoTime() as follows long start, end; JavaHiveContext hiveCtx; JavaSchemaRDD hiveResult; start = System.nanoTime(); hiveResult = hiveCtx.hql(query); end = System.nanoTime(); System.out.println(start-end); But the result I got is drastically different from the execution times recorded in SparkUI. Can you please explain this disparity? Look forward to hearing from you. rgds -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 https://twitter.com/N1R44 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL Query and join different data sources.
Actually, with HiveContext, you can join hive tables with registered temporary tables. On Fri, Aug 22, 2014 at 9:07 PM, chutium teng@gmail.com wrote: oops, thanks Yan, you are right, i got scala sqlContext.sql(select * from a join b).take(10) java.lang.RuntimeException: Table Not Found: b at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:90) and with hql scala hiveContext.hql(select * from a join b).take(10) warning: there were 1 deprecation warning(s); re-run with -deprecation for details 14/08/22 14:48:45 INFO parse.ParseDriver: Parsing command: select * from a join b 14/08/22 14:48:45 INFO parse.ParseDriver: Parse Completed 14/08/22 14:48:45 ERROR metadata.Hive: NoSuchObjectException(message:default.a table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27129) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27097) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:27028) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy17.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:59) so sqlContext is looking up table from org.apache.spark.sql.catalyst.analysis.SimpleCatalog, Catalog.scala hiveContext looking up from org.apache.spark.sql.hive.HiveMetastoreCatalog, HiveMetastoreCatalog.scala maybe we can do something in sqlContext to register a hive table as Spark-SQL-Table, need to read column info, partition info, location, SerDe, Input/OutputFormat and maybe StorageHandler also, from the hive metastore... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7955.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
about spark assembly jar
hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if we want to update the version of some component(such as hadoop) 2 in our practice with spark, sometimes we meet jar compatibility issue, it is hard to diagnose compatibility issue with assembly jar - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: about spark assembly jar
Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run against exactly what will be used in a real Spark distro. On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote: hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if we want to update the version of some component(such as hadoop) 2 in our practice with spark, sometimes we meet jar compatibility issue, it is hard to diagnose compatibility issue with assembly jar - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: about spark assembly jar
yes, i am not sure what happens when building assembly jar and in my understanding it just package all the dependency jars to a big one. On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run against exactly what will be used in a real Spark distro. On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote: hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if we want to update the version of some component(such as hadoop) 2 in our practice with spark, sometimes we meet jar compatibility issue, it is hard to diagnose compatibility issue with assembly jar - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: about spark assembly jar
Sorry, The quick reply didn't cc the dev list. Sean, sometimes I have to use the spark-shell to confirm some behavior change. In that case, I have to reassembly the whole project. is there another way around, not use the the big jar in development? For the original question, I have no comments. -- Ye Xianjin Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, September 2, 2014 at 4:58 PM, Sean Owen wrote: No, usually you unit-test your changes during development. That doesn't require the assembly. Eventually you may wish to test some change against the complete assembly. But that's a different question; I thought you were suggesting that the assembly JAR should never be created. On Tue, Sep 2, 2014 at 9:53 AM, Ye Xianjin advance...@gmail.com (mailto:advance...@gmail.com) wrote: Hi, Sean: In development, do I really need to reassembly the whole project even if I only change a line or two code in one component? I used to that but found time-consuming. -- Ye Xianjin Sent with Sparrow On Tuesday, September 2, 2014 at 4:45 PM, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run against exactly what will be used in a real Spark distro. On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com (mailto:wangf...@huawei.com) wrote: hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if we want to update the version of some component(such as hadoop) 2 in our practice with spark, sometimes we meet jar compatibility issue, it is hard to diagnose compatibility issue with assembly jar - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org (mailto:dev-unsubscr...@spark.apache.org) For additional commands, e-mail: dev-h...@spark.apache.org (mailto:dev-h...@spark.apache.org) - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org (mailto:dev-unsubscr...@spark.apache.org) For additional commands, e-mail: dev-h...@spark.apache.org (mailto:dev-h...@spark.apache.org)
Re: about spark assembly jar
Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$.getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local.IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer.compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:283) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4194) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run against exactly what will be used in a real Spark distro. On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote: hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if we want to update the version of some component(such as hadoop) 2 in our practice with spark, sometimes we meet jar compatibility issue, it is hard to diagnose compatibility issue with assembly jar - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
Zongheng pointed out in my SPARK-3329 PR (https://github.com/apache/spark/pull/2220) that Aaron had already fixed this issue but that it had gotten inadvertently clobbered by another patch. I don't know how the project handles this kind of problem, but I've rewritten my SPARK-3329 branch to cherry-pick Aaron's fix (also fixing a merge conflict and handling a test case that it didn't). The other weird spurious testsuite failures related to orderings I've seen were in DESCRIBE FUNCTION EXTENDED for functions with lists of synonyms (e.g. STDDEV). I can't reproduce those now but will take another look later this week. best, wb - Original Message - From: Sean Owen so...@cloudera.com To: Will Benton wi...@redhat.com Cc: Patrick Wendell pwend...@gmail.com, dev@spark.apache.org Sent: Sunday, August 31, 2014 12:18:42 PM Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC3) Fantastic. As it happens, I just fixed up Mahout's tests for Java 8 and observed a lot of the same type of failure. I'm about to submit PRs for the two issues I identified. AFAICT these 3 then cover the failures I mentioned: https://issues.apache.org/jira/browse/SPARK-3329 https://issues.apache.org/jira/browse/SPARK-3330 https://issues.apache.org/jira/browse/SPARK-3331 I'd argue that none necessarily block a release, since they just represent a problem with test-only code in Java 8, with the test-only context of Jenkins and multiple profiles, and with a trivial configuration in a style check for Python. Should be fixed but none indicate a bug in the release. On Sun, Aug 31, 2014 at 6:11 PM, Will Benton wi...@redhat.com wrote: - Original Message - dev/run-tests fails two tests (1 Hive, 1 Kafka Streaming) for me locally on 1.1.0-rc3. Does anyone else see that? It may be my env. Although I still see the Hive failure on Debian too: [info] - SET commands semantics for a HiveContext *** FAILED *** [info] Expected Array(spark.sql.key.usedfortestonly=test.val.0, spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0), but got Array(spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0, spark.sql.key.usedfortestonly=test.val.0) (HiveQuerySuite.scala:541) I've seen this error before. (In particular, I've seen it on my OS X machine using Oracle JDK 8 but not on Fedora using OpenJDK.) I've also seen similar errors in topic branches (but not on master) that seem to indicate that tests depend on sets of pairs arriving from Hive in a particular order; it seems that this isn't a safe assumption. I just submitted a (trivial) PR to fix this spurious failure: https://github.com/apache/spark/pull/2220 best, wb - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: about spark assembly jar
This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$. getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$. getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator. init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM. scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call( AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local. IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer. compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main. scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain( Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain( Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.getObjectInspector( ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde. ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils. getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table. getDeserializerFromMetaStore(Table.java:283) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity( Table.java:189) at org.apache.hadoop.hive.ql.metadata.Hive.createTable( Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable( DDLTask.java:4194) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask. java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential( TaskRunner.java:85) On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run against exactly what will be used in a real Spark distro. On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote: hi, all I suggest spark not use assembly jar as default run-time dependency(spark-submit/spark-class depend on assembly jar),use a library of all 3rd dependency jar like hadoop/hive/hbase more reasonable. 1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this jar if we want to update the version of some component(such as hadoop) 2 in
hive client.getAllPartitions in lookupRelation can take a very long time
in our hive warehouse there are many tables with a lot of partitions, such as scala hiveContext.sql(use db_external) scala val result = hiveContext.sql(show partitions et_fullorders).count result: Long = 5879 i noticed that this part of code: https://github.com/apache/spark/blob/9d006c97371ddf357e0b821d5c6d1535d9b6fe41/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L55-L56 reads the whole partitions info at the beginning of plan phase, i added a logInfo around this val partitions = ... it shows: scala val result = hiveContext.sql(select * from db_external.et_fullorders limit 5) 14/09/02 16:15:56 INFO ParseDriver: Parsing command: select * from db_external.et_fullorders limit 5 14/09/02 16:15:56 INFO ParseDriver: Parse Completed 14/09/02 16:15:56 INFO HiveContext$$anon$1: getAllPartitionsForPruner started 14/09/02 16:17:35 INFO HiveContext$$anon$1: getAllPartitionsForPruner finished it took about 2min to get all partitions... is there any possible way to avoid this operation? such as only fetch the requested partition somehow? Thanks -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/hive-client-getAllPartitions-in-lookupRelation-can-take-a-very-long-time-tp8186.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
hey spark developers! intro from shane knapp, devops engineer @ AMPLab
so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Welcome, Shane! On Tuesday, September 2, 2014, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Hi Shane! Thank you for doing the Jenkins upgrade last week. It's nice to know that infrastructure is gonna get some dedicated TLC going forward. Welcome aboard! Nick On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane
Re: about spark assembly jar
Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$. getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$. getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator. init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM. scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call( AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local. IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer. compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main. scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain( Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain( Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.getObjectInspector( ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde. ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils. getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table. getDeserializerFromMetaStore(Table.java:283) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity( Table.java:189) at org.apache.hadoop.hive.ql.metadata.Hive.createTable( Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable( DDLTask.java:4194) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask. java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential( TaskRunner.java:85) On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run against exactly what will be used in a real Spark distro. On Tue, Sep 2, 2014 at 9:39 AM, scwf
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Hey Shane, Thanks for your work so far and I'm really happy to see investment in this infrastructure. This is a key productivity tool for us and something we'd love to expand over time to improve the development process of Spark. - Patrick On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hi Shane! Thank you for doing the Jenkins upgrade last week. It's nice to know that infrastructure is gonna get some dedicated TLC going forward. Welcome aboard! Nick On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Welcome, Shane. As a former prof and eng dir at Google, I've been expecting this to be a first-class engineering college subject. I just didn't expect it to come through this route :-) So congrats, and I hope you represent the beginning of a great new trend at universities. Sent while mobile. Please excuse typos etc. On Sep 2, 2014 11:00 AM, Patrick Wendell pwend...@gmail.com wrote: Hey Shane, Thanks for your work so far and I'm really happy to see investment in this infrastructure. This is a key productivity tool for us and something we'd love to expand over time to improve the development process of Spark. - Patrick On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hi Shane! Thank you for doing the Jenkins upgrade last week. It's nice to know that infrastructure is gonna get some dedicated TLC going forward. Welcome aboard! Nick On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Resource allocation
Hi, I want to incorporate some intelligence while choosing the resources for rdd replication. I thought, if we replicate rdd on specially chosen nodes based on the capabilities, the next application that requires this rdd can be executed more efficiently. But, I found that an rdd creatd by an appplication is owned by only that application and nobody else can access it. Can someone tell me what kind of operations can be done on a replicated rdd. Or to put it other way, what are the benefits of a replicated rdd or what operations can be performed on a replicated rdd. I just want to know how effective is my work going to be. I'll be happy if some other ideas in the similar line of thought are suggested. Thank you!! Karthik
Re: about spark assembly jar
Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a developer notes page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote: Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$. getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$. getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator. init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM. scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call( AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local. IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer. compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main. scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain( Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain( Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.getObjectInspector( ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde. ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils. getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table. getDeserializerFromMetaStore(Table.java:283) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity( Table.java:189) at org.apache.hadoop.hive.ql.metadata.Hive.createTable( Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable( DDLTask.java:4194) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask. java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential( TaskRunner.java:85) On 2014/9/2 16:45, Sean Owen wrote: Hm, are you suggesting that the Spark
Re: about spark assembly jar
SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be easier to find): https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) wrote: Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a developer notes page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote: Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$. getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$. getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator. init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM. scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call( AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local. IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer. compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main. scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain( Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain( Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.getObjectInspector( ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde. ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils. getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table. getDeserializerFromMetaStore(Table.java:283) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity( Table.java:189) at org.apache.hadoop.hive.ql.metadata.Hive.createTable( Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable( DDLTask.java:4194) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask. java:281) at
Re: about spark assembly jar
Cool, didn't notice that, thanks Josh! On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen rosenvi...@gmail.com wrote: SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be easier to find): https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) wrote: Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a developer notes page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote: Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote: This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$. getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$. getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator. init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM. scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call( AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local. IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer. compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main. scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain( Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain( Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.getObjectInspector( ArrayWritableObjectInspector.java:66) at org.apache.hadoop.hive.ql.io.parquet.serde. ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59) at org.apache.hadoop.hive.ql.io.parquet.serde. ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113) at org.apache.hadoop.hive.metastore.MetaStoreUtils. getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table. getDeserializerFromMetaStore(Table.java:283) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity( Table.java:189) at org.apache.hadoop.hive.ql.metadata.Hive.createTable( Hive.java:597) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable( DDLTask.java:4194) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask. java:281) at
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Welcome Shane =) - Henry On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab
Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins :) On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra henry.sapu...@gmail.com wrote: Welcome Shane =) - Henry On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote: so, i had a meeting w/the databricks guys on friday and they recommended i send an email out to the list to say 'hi' and give you guys a quick intro. :) hi! i'm shane knapp, the new AMPLab devops engineer, and will be spending time getting the jenkins build infrastructure up to production quality. much of this will be 'under the covers' work, like better system level auth, backups, etc, but some will definitely be user facing: timely jenkins updates, debugging broken build infrastructure and some plugin support. i've been working in the bay area now since 1997 at many different companies, and my last 10 years has been split between google and palantir. i'm a huge proponent of OSS, and am really happy to be able to help with the work you guys are doing! if anyone has any requests/questions/comments, feel free to drop me a line! shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Checkpointing Pregel
Hey guys, I’m trying to run connected components on graphs that end up running for a fairly large number of iterations (25-30) and take 5-6 hours. I find more than half the time I end up getting fetch failures and losing an executor after a number of iterations. Then it has to go back and recompute pieces that it lost, which don’t seem to be getting persisted at the same level as the graph so those iterations take exponentially longer and I have to kill the job because it’s not worth waiting for it to finish. The approach I’m currently trying is checkpointing the vertices and edges (and maybe the messages?) in Pregel. What I’ve been testing with so far is the below patch, which seems to be working (actually I haven’t had any failures since I added this change, so I don’t know if I did get one if it would recompute from the start or not) but I’m also seeing things like 5 instances of VertexRDDs being persisted all at the same time and “reduce at VertexRDD.scala:111” runs twice each time. I was wondering if this is the proper / most efficient way of doing this checkpointing, and if not what would work better? diff --git a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala index 5e55620..5be40c3 100644 --- a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala +++ b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala @@ -134,6 +134,11 @@ object Pregel extends Logging { g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) } g.cache() + g.vertices.checkpoint() + g.vertices.count() + g.edges.checkpoint() + g.edges.count() + val oldMessages = messages // Send new messages. Vertices that didn't get any messages don't appear in newVerts, so don't // get to send messages. We must cache messages so it can be materialized on the next line, @@ -142,6 +147,7 @@ object Pregel extends Logging { // The call to count() materializes `messages`, `newVerts`, and the vertices of `g`. This // hides oldMessages (depended on by newVerts), newVerts (depended on by messages), and the // vertices of prevG (depended on by newVerts, oldMessages, and the vertices of g). + messages.checkpoint() activeMessages = messages.count() logInfo(Pregel finished iteration + i) Best Regards, Jeffrey Picard signature.asc Description: Message signed with OpenPGP using GPGMail
Ask something about spark
Hi, I am phoenixlee and a Spark programmer in Korea. And be a good chance this time, it tries to teach college students and office workers to Spark. This course will be done with the support of the government. Can I use the data(pictures, samples, etc.) in the spark homepage for this course? Of course, I will put the comments in thanks and webpage URL. It would be a good opportunity, even though the findings were that there is no teaching materials Spark and education (or community) still in Korea. Thanks. ᐧ
Re: Ask something about spark
I think in general that is fine. It would be great if your slides come with proper attribution. On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee phoenixl...@gmail.com wrote: Hi, I am phoenixlee and a Spark programmer in Korea. And be a good chance this time, it tries to teach college students and office workers to Spark. This course will be done with the support of the government. Can I use the data(pictures, samples, etc.) in the spark homepage for this course? Of course, I will put the comments in thanks and webpage URL. It would be a good opportunity, even though the findings were that there is no teaching materials Spark and education (or community) still in Korea. Thanks. ᐧ
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote: +1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote: +1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote: +1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 Tested on Mac OSX, Thrift Server, SparkSQL On September 2, 2014 at 17:29:29, Michael Armbrust (mich...@databricks.com) wrote: +1 On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac OS X. Matei On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote: +1 Verified PySpark InputFormat/OutputFormat examples. On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote: +1 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote: +1 - Tested Thrift server and SQL CLI locally on OSX 10.9. - Checked datanucleus dependencies in distribution tarball built by make-distribution.sh without SPARK_HIVE defined. On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote: +1 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8). best, wb - Original Message - From: Patrick Wendell pwend...@gmail.com To: dev@spark.apache.org Sent: Saturday, August 30, 2014 5:07:52 PM Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 From: Patrick Wendell [pwend...@gmail.com] Sent: Saturday, August 30, 2014 4:08 PM To: dev@spark.apache.org Subject: [VOTE] Release Apache Spark 1.1.0 (RC3) Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: quick jenkins restart
and we're back and building! On Tue, Sep 2, 2014 at 5:07 PM, shane knapp skn...@berkeley.edu wrote: since our queue is really short, i'm waiting for a couple of builds to finish and will be restarting jenkins to install/update some plugins. the github pull request builder looks like it has some fixes to reduce spammy github calls, and reduce any potential rate limiting. i'll let everyone know when it's back up... this should be super quick (~15 mins for tests to finish, ~2 mins for jenkins to restart). thanks in advance! shane
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
+1 Tested on HDP 2.1 Sandbox, Thrift Server with Simba Shark ODBC Paolo Da: Jeremy Freemanmailto:freeman.jer...@gmail.com Data invio: ?mercoled?? ?3? ?settembre? ?2014 ?02?:?34 A: d...@spark.incubator.apache.orgmailto:d...@spark.incubator.apache.org +1 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
In light of the discussion on SPARK-, I'll revoke my -1 vote. The issue does not appear to be serious. On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: -1: I believe I've found a regression from 1.0.2. The report is captured in SPARK- https://issues.apache.org/jira/browse/SPARK-. On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: about spark assembly jar
Yea, SSD + SPARK_PREPEND_CLASSES is great for iterative development! Then why it is ok with a bag of 3rd jars but throw error with assembly jar, any one have idea? On 2014/9/3 2:57, Cheng Lian wrote: Cool, didn't notice that, thanks Josh! On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen rosenvi...@gmail.com mailto:rosenvi...@gmail.com wrote: SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be easier to find): https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com mailto:lian.cs@gmail.com) wrote: Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :) Maybe we should add a developer notes page to document all these useful black magic. On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com mailto:r...@databricks.com wrote: Having a SSD help tremendously with assembly time. Without that, you can do the following in order for Spark to pick up the compiled classes before assembly at runtime. export SPARK_PREPEND_CLASSES=true On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com mailto:sandy.r...@cloudera.com wrote: This doesn't help for every dependency, but Spark provides an option to build the assembly jar without Hadoop and its dependencies. We make use of this in CDH packaging. -Sandy On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com mailto:wangf...@huawei.com wrote: Hi sean owen, here are some problems when i used assembly jar 1 i put spark-assembly-*.jar to the lib directory of my application, it throw compile error Error:scalac: Error: class scala.reflect.BeanInfo not found. scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found. at scala.tools.nsc.symtab.Definitions$definitions$. getModuleOrClass(Definitions.scala:655) at scala.tools.nsc.symtab.Definitions$definitions$. getClass(Definitions.scala:608) at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator. init(GenJVM.scala:127) at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM. scala:85) at scala.tools.nsc.Global$Run.compileSources(Global.scala:953) at scala.tools.nsc.Global$Run.compile(Global.scala:1041) at xsbt.CachedCompiler0.run(CompilerInterface.scala:126) at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102) at xsbt.CachedCompiler0.run(CompilerInterface.scala:102) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sbt.compiler.AnalyzingCompiler.call( AnalyzingCompiler.scala:102) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:48) at sbt.compiler.AnalyzingCompiler.compile( AnalyzingCompiler.scala:41) at org.jetbrains.jps.incremental.scala.local. IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28) at org.jetbrains.jps.incremental.scala.local.LocalServer. compile(LocalServer.scala:25) at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main. scala:58) at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain( Main.scala:21) at org.jetbrains.jps.incremental.scala.remote.Main.nailMain( Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319) 2 i test my branch which updated hive version to org.apache.hive 0.13.1 it run successfully when use a bag of 3rd jars as dependency but throw error using assembly jar, it seems assembly jar lead to conflict ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo at org.apache.hadoop.hive.ql.io.parquet.serde.
Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
Thanks everyone for voting on this. There were two minor issues (one a blocker) were found that warrant cutting a new RC. For those who voted +1 on this release, I'd encourage you to +1 rc4 when it comes out unless you have been testing issues specific to the EC2 scripts. This will move the release process along. SPARK-3332 - Issue with tagging in EC2 scripts SPARK-3358 - Issue with regression for m3.XX instances - Patrick On Tue, Sep 2, 2014 at 6:55 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: In light of the discussion on SPARK-, I'll revoke my -1 vote. The issue does not appear to be serious. On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: -1: I believe I've found a regression from 1.0.2. The report is captured in SPARK-. On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.1.0! The tag to be voted on is v1.1.0-rc3 (commit b2d0493b): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1030/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/ Please vote on releasing this package as Apache Spark 1.1.0! The vote is open until Tuesday, September 02, at 23:07 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == Regressions fixed since RC1 == - Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234 - EC2 script version bump to 1.1.0. == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.0.X will not block this release. == What default changes should I be aware of? == 1. The default value of spark.io.compression.codec is now snappy -- Old behavior can be restored by switching to lzf 2. PySpark now performs external spilling during aggregations. -- Old behavior can be restored by setting spark.shuffle.spill to false. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [Spark SQL] off-heap columnar store
On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell i...@ianoconnell.com wrote: I'm not sure what you mean here? Parquet is at its core just a format, you could store that data anywhere. Though it sounds like you saying, correct me if i'm wrong: you basically want a columnar abstraction layer where you can provide a different backing implementation to keep the columns rather than parquet-mr? I.e. you want to be able to produce a schema RDD from something like vertica, where updates should act as a write through cache back to vertica itself? Something like that. I'd like, 1) An API to produce a schema RDD from an RDD of columns, not rows. However, an RDD[Column] would not make sense, since it would be spread out across partitions. Perhaps what is needed is a Seq[RDD[ColumnSegment]].The idea is that each RDD would hold the segments for one column. The segments represent a range of rows. This would then read from something like Vertica or Cassandra. 2) A variant of 1) where you could read this data from Tachyon. Tachyon is supposed to support a columnar representation of data, it did for Shark 0.9.x. The goal is basically to load columnar data from something like Cassandra into Tachyon, with the compression ratio of columnar storage, and the speed of InMemoryColumnarTableScan. If data is appended into the Tachyon representation, be able to cache it back. The write back is not as high a priority though. A workaround would be to read data from Cassandra/Vertica/etc. and write back into Parquet, but this would take a long time and incur huge I/O overhead. I'm sorry it just sounds like its worth clearly defining what your key requirement/goal is. On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan velvia.git...@gmail.com wrote: The reason I'm asking about the columnar compressed format is that there are some problems for which Parquet is not practical. Can you elaborate? Sure. - Organization or co has no Hadoop, but significant investment in some other NoSQL store. - Need to efficiently add a new column to existing data - Need to mark some existing rows as deleted or replace small bits of existing data For these use cases, it would be much more efficient and practical if we didn't have to take the origin of the data from the datastore, convert it to Parquet first. Doing so loses significant latency and causes Ops headaches in having to maintain HDFS. It would be great to be able to load data directly into the columnar format, into the InMemoryColumnarCache. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org