date:20140902

Getting the execution times of spark job

2014-09-02 Thread Niranda Perera

Hi,

I have been playing around with spark for a couple of days. I am
using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the
implementation is to run Hive queries on Spark. I used JavaHiveContext to
achieve this (As per the examples).

I have 2 questions.
1. I am wondering how I could get the execution times of a spark job? Does
Spark provide monitoring facilities in the form of an API?

2. I used a laymen way to get the execution times by enclosing a
JavaHiveContext.hql method with System.nanoTime() as follows

long start, end;
JavaHiveContext hiveCtx;
JavaSchemaRDD hiveResult;

start = System.nanoTime();
hiveResult = hiveCtx.hql(query);
end = System.nanoTime();
System.out.println(start-end);

But the result I got is drastically different from the execution times
recorded in SparkUI. Can you please explain this disparity?

Look forward to hearing from you.

rgds

-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 https://twitter.com/N1R44

Re: Getting the execution times of spark job

2014-09-02 Thread Zongheng Yang

For your second question: hql() (as well as sql()) does not launch a
Spark job immediately; instead, it fires off the Spark SQL
parser/optimizer/planner pipeline first, and a Spark job will be
started after the a physical execution plan is selected. Therefore,
your hand-rolled end-to-end measurement includes the time to go
through the Spark SQL code path, and the times reported inside the UI
are the execution times of the Spark job(s) only.

On Mon, Sep 1, 2014 at 11:45 PM, Niranda Perera nira...@wso2.com wrote:
 Hi,

 I have been playing around with spark for a couple of days. I am
 using spark-1.0.1-bin-hadoop1 and the Java API. The main idea of the
 implementation is to run Hive queries on Spark. I used JavaHiveContext to
 achieve this (As per the examples).

 I have 2 questions.
 1. I am wondering how I could get the execution times of a spark job? Does
 Spark provide monitoring facilities in the form of an API?

 2. I used a laymen way to get the execution times by enclosing a
 JavaHiveContext.hql method with System.nanoTime() as follows

 long start, end;
 JavaHiveContext hiveCtx;
 JavaSchemaRDD hiveResult;

 start = System.nanoTime();
 hiveResult = hiveCtx.hql(query);
 end = System.nanoTime();
 System.out.println(start-end);

 But the result I got is drastically different from the execution times
 recorded in SparkUI. Can you please explain this disparity?

 Look forward to hearing from you.

 rgds

 --
 *Niranda Perera*
 Software Engineer, WSO2 Inc.
 Mobile: +94-71-554-8430
 Twitter: @n1r44 https://twitter.com/N1R44

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark SQL Query and join different data sources.

2014-09-02 Thread Yin Huai

Actually, with HiveContext, you can join hive tables with registered
temporary tables.


On Fri, Aug 22, 2014 at 9:07 PM, chutium teng@gmail.com wrote:

 oops, thanks Yan, you are right, i got

 scala sqlContext.sql(select * from a join b).take(10)
 java.lang.RuntimeException: Table Not Found: b
 at scala.sys.package$.error(package.scala:27)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:90)
 at scala.Option.getOrElse(Option.scala:120)
 at

 org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:90)

 and with hql

 scala hiveContext.hql(select * from a join b).take(10)
 warning: there were 1 deprecation warning(s); re-run with -deprecation for
 details
 14/08/22 14:48:45 INFO parse.ParseDriver: Parsing command: select * from a
 join b
 14/08/22 14:48:45 INFO parse.ParseDriver: Parse Completed
 14/08/22 14:48:45 ERROR metadata.Hive:
 NoSuchObjectException(message:default.a table not found)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27129)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:27097)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:27028)
 at
 org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
 at

 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:922)
 at

 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:854)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at

 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 at com.sun.proxy.$Proxy17.getTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
 at

 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:59)


 so sqlContext is looking up table from
 org.apache.spark.sql.catalyst.analysis.SimpleCatalog, Catalog.scala
 hiveContext looking up from org.apache.spark.sql.hive.HiveMetastoreCatalog,
 HiveMetastoreCatalog.scala

 maybe we can do something in sqlContext to register a hive table as
 Spark-SQL-Table, need to read column info, partition info, location, SerDe,
 Input/OutputFormat and maybe StorageHandler also, from the hive
 metastore...




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Query-and-join-different-data-sources-tp7914p7955.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

about spark assembly jar

2014-09-02 Thread scwf


hi, all
  I suggest spark not use assembly jar as default run-time 
dependency(spark-submit/spark-class depend on assembly jar),use a library of 
all 3rd dependency jar like hadoop/hive/hbase more reasonable.

  1 assembly jar packaged all 3rd jars into a big one, so we need rebuild this 
jar if we want to update the version of some component(such as hadoop)
  2 in our practice with spark, sometimes we meet jar compatibility issue, it 
is hard to diagnose compatibility issue with assembly jar







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: about spark assembly jar

2014-09-02 Thread Sean Owen

Hm, are you suggesting that the Spark distribution be a bag of 100
JARs? It doesn't quite seem reasonable. It does not remove version
conflicts, just pushes them to run-time, which isn't good. The
assembly is also necessary because that's where shading happens. In
development, you want to run against exactly what will be used in a
real Spark distro.

On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote:
 hi, all
   I suggest spark not use assembly jar as default run-time
 dependency(spark-submit/spark-class depend on assembly jar),use a library of
 all 3rd dependency jar like hadoop/hive/hbase more reasonable.

   1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
 this jar if we want to update the version of some component(such as hadoop)
   2 in our practice with spark, sometimes we meet jar compatibility issue,
 it is hard to diagnose compatibility issue with assembly jar







 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: about spark assembly jar

2014-09-02 Thread scwf


yes, i am not sure what happens when building assembly jar and in my 
understanding it just package all the dependency jars to a big one.

On 2014/9/2 16:45, Sean Owen wrote:

Hm, are you suggesting that the Spark distribution be a bag of 100
JARs? It doesn't quite seem reasonable. It does not remove version
conflicts, just pushes them to run-time, which isn't good. The
assembly is also necessary because that's where shading happens. In
development, you want to run against exactly what will be used in a
real Spark distro.

On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote:

hi, all
   I suggest spark not use assembly jar as default run-time
dependency(spark-submit/spark-class depend on assembly jar),use a library of
all 3rd dependency jar like hadoop/hive/hbase more reasonable.

   1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
this jar if we want to update the version of some component(such as hadoop)
   2 in our practice with spark, sometimes we meet jar compatibility issue,
it is hard to diagnose compatibility issue with assembly jar







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org








-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: about spark assembly jar

2014-09-02 Thread Ye Xianjin

Sorry, The quick reply didn't cc the dev list.

Sean, sometimes I have to use the spark-shell to confirm some behavior change. 
In that case, I have to reassembly the whole project.  is there another way 
around, not use the the big jar in development? For the original question, I 
have no comments. 

-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Tuesday, September 2, 2014 at 4:58 PM, Sean Owen wrote:

 No, usually you unit-test your changes during development. That
 doesn't require the assembly. Eventually you may wish to test some
 change against the complete assembly.
 
 But that's a different question; I thought you were suggesting that
 the assembly JAR should never be created.
 
 On Tue, Sep 2, 2014 at 9:53 AM, Ye Xianjin advance...@gmail.com 
 (mailto:advance...@gmail.com) wrote:
  Hi, Sean:
  In development, do I really need to reassembly the whole project even if I
  only change a line or two code in one component?
  I used to that but found time-consuming.
  
  --
  Ye Xianjin
  Sent with Sparrow
  
  On Tuesday, September 2, 2014 at 4:45 PM, Sean Owen wrote:
  
  Hm, are you suggesting that the Spark distribution be a bag of 100
  JARs? It doesn't quite seem reasonable. It does not remove version
  conflicts, just pushes them to run-time, which isn't good. The
  assembly is also necessary because that's where shading happens. In
  development, you want to run against exactly what will be used in a
  real Spark distro.
  
  On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com 
  (mailto:wangf...@huawei.com) wrote:
  
  hi, all
  I suggest spark not use assembly jar as default run-time
  dependency(spark-submit/spark-class depend on assembly jar),use a library of
  all 3rd dependency jar like hadoop/hive/hbase more reasonable.
  
  1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
  this jar if we want to update the version of some component(such as hadoop)
  2 in our practice with spark, sometimes we meet jar compatibility issue,
  it is hard to diagnose compatibility issue with assembly jar
  
  
  
  
  
  
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)
  
  
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
  (mailto:dev-unsubscr...@spark.apache.org)
  For additional commands, e-mail: dev-h...@spark.apache.org 
  (mailto:dev-h...@spark.apache.org)

Re: about spark assembly jar

2014-09-02 Thread scwf


Hi sean owen,
here are some problems when i used assembly jar
1 i put spark-assembly-*.jar to the lib directory of my application, it throw 
compile error

Error:scalac: Error: class scala.reflect.BeanInfo not found.
scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not found.

at 
scala.tools.nsc.symtab.Definitions$definitions$.getModuleOrClass(Definitions.scala:655)

at 
scala.tools.nsc.symtab.Definitions$definitions$.getClass(Definitions.scala:608)

at 
scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.init(GenJVM.scala:127)

at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.scala:85)

at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)

at scala.tools.nsc.Global$Run.compile(Global.scala:1041)

at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)

at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)

at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)

at xsbt.CompilerInterface.run(CompilerInterface.scala:27)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at sbt.compiler.AnalyzingCompiler.call(AnalyzingCompiler.scala:102)

at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:48)

at sbt.compiler.AnalyzingCompiler.compile(AnalyzingCompiler.scala:41)

at 
org.jetbrains.jps.incremental.scala.local.IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)

at 
org.jetbrains.jps.incremental.scala.local.LocalServer.compile(LocalServer.scala:25)

at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.scala:58)

at 
org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(Main.scala:21)

at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
2 i test my branch which updated hive version to org.apache.hive 0.13.1
  it run successfully when use a bag of 3rd jars as dependency but throw error 
using assembly jar, it seems assembly jar lead to conflict
  ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getObjectInspector(ArrayWritableObjectInspector.java:66)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:283)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:597)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4194)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)




On 2014/9/2 16:45, Sean Owen wrote:

Hm, are you suggesting that the Spark distribution be a bag of 100
JARs? It doesn't quite seem reasonable. It does not remove version
conflicts, just pushes them to run-time, which isn't good. The
assembly is also necessary because that's where shading happens. In
development, you want to run against exactly what will be used in a
real Spark distro.

On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote:

hi, all
   I suggest spark not use assembly jar as default run-time
dependency(spark-submit/spark-class depend on assembly jar),use a library of
all 3rd dependency jar like hadoop/hive/hbase more reasonable.

   1 assembly jar packaged all 3rd jars into a big one, so we need rebuild
this jar if we want to update the version of some component(such as hadoop)
   2 in our practice with spark, sometimes we meet jar compatibility issue,
it is hard to diagnose compatibility issue with assembly jar







-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton

Zongheng pointed out in my SPARK-3329 PR 
(https://github.com/apache/spark/pull/2220) that Aaron had already fixed this 
issue but that it had gotten inadvertently clobbered by another patch.  I don't 
know how the project handles this kind of problem, but I've rewritten my 
SPARK-3329 branch to cherry-pick Aaron's fix (also fixing a merge conflict and 
handling a test case that it didn't).

The other weird spurious testsuite failures related to orderings I've seen were 
in DESCRIBE FUNCTION EXTENDED for functions with lists of synonyms (e.g. 
STDDEV).  I can't reproduce those now but will take another look later this 
week.



best,
wb

- Original Message -
 From: Sean Owen so...@cloudera.com
 To: Will Benton wi...@redhat.com
 Cc: Patrick Wendell pwend...@gmail.com, dev@spark.apache.org
 Sent: Sunday, August 31, 2014 12:18:42 PM
 Subject: Re: [VOTE] Release Apache Spark 1.1.0 (RC3)
 
 Fantastic. As it happens, I just fixed up Mahout's tests for Java 8
 and observed a lot of the same type of failure.
 
 I'm about to submit PRs for the two issues I identified. AFAICT these
 3 then cover the failures I mentioned:
 
 https://issues.apache.org/jira/browse/SPARK-3329
 https://issues.apache.org/jira/browse/SPARK-3330
 https://issues.apache.org/jira/browse/SPARK-3331
 
 I'd argue that none necessarily block a release, since they just
 represent a problem with test-only code in Java 8, with the test-only
 context of Jenkins and multiple profiles, and with a trivial
 configuration in a style check for Python. Should be fixed but none
 indicate a bug in the release.
 
 On Sun, Aug 31, 2014 at 6:11 PM, Will Benton wi...@redhat.com wrote:
  - Original Message -
 
  dev/run-tests fails two tests (1 Hive, 1 Kafka Streaming) for me
  locally on 1.1.0-rc3. Does anyone else see that? It may be my env.
  Although I still see the Hive failure on Debian too:
 
  [info] - SET commands semantics for a HiveContext *** FAILED ***
  [info]   Expected Array(spark.sql.key.usedfortestonly=test.val.0,
  spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0),
  but got
  Array(spark.sql.key.usedfortestonlyspark.sql.key.usedfortestonly=test.val.0test.val.0,
  spark.sql.key.usedfortestonly=test.val.0) (HiveQuerySuite.scala:541)
 
  I've seen this error before.  (In particular, I've seen it on my OS X
  machine using Oracle JDK 8 but not on Fedora using OpenJDK.)  I've also
  seen similar errors in topic branches (but not on master) that seem to
  indicate that tests depend on sets of pairs arriving from Hive in a
  particular order; it seems that this isn't a safe assumption.
 
  I just submitted a (trivial) PR to fix this spurious failure:
  https://github.com/apache/spark/pull/2220
 
 
  best,
  wb
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: about spark assembly jar

2014-09-02 Thread Sandy Ryza

This doesn't help for every dependency, but Spark provides an option to
build the assembly jar without Hadoop and its dependencies.  We make use of
this in CDH packaging.

-Sandy


On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:

 Hi sean owen,
 here are some problems when i used assembly jar
 1 i put spark-assembly-*.jar to the lib directory of my application, it
 throw compile error

 Error:scalac: Error: class scala.reflect.BeanInfo not found.
 scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not
 found.

 at scala.tools.nsc.symtab.Definitions$definitions$.
 getModuleOrClass(Definitions.scala:655)

 at scala.tools.nsc.symtab.Definitions$definitions$.
 getClass(Definitions.scala:608)

 at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
 init(GenJVM.scala:127)

 at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
 scala:85)

 at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)

 at scala.tools.nsc.Global$Run.compile(Global.scala:1041)

 at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)

 at xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)

 at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)

 at xsbt.CompilerInterface.run(CompilerInterface.scala:27)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(
 NativeMethodAccessorImpl.java:39)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
 DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at sbt.compiler.AnalyzingCompiler.call(
 AnalyzingCompiler.scala:102)

 at sbt.compiler.AnalyzingCompiler.compile(
 AnalyzingCompiler.scala:48)

 at sbt.compiler.AnalyzingCompiler.compile(
 AnalyzingCompiler.scala:41)

 at org.jetbrains.jps.incremental.scala.local.
 IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)

 at org.jetbrains.jps.incremental.scala.local.LocalServer.
 compile(LocalServer.scala:25)

 at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
 scala:58)

 at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
 Main.scala:21)

 at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
 Main.scala)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at sun.reflect.NativeMethodAccessorImpl.invoke(
 NativeMethodAccessorImpl.java:39)

 at sun.reflect.DelegatingMethodAccessorImpl.invoke(
 DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)

 at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
 2 i test my branch which updated hive version to org.apache.hive 0.13.1
   it run successfully when use a bag of 3rd jars as dependency but throw
 error using assembly jar, it seems assembly jar lead to conflict
   ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
 at org.apache.hadoop.hive.ql.io.parquet.serde.
 ArrayWritableObjectInspector.getObjectInspector(
 ArrayWritableObjectInspector.java:66)
 at org.apache.hadoop.hive.ql.io.parquet.serde.
 ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
 at org.apache.hadoop.hive.ql.io.parquet.serde.
 ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
 at org.apache.hadoop.hive.metastore.MetaStoreUtils.
 getDeserializer(MetaStoreUtils.java:339)
 at org.apache.hadoop.hive.ql.metadata.Table.
 getDeserializerFromMetaStore(Table.java:283)
 at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
 Table.java:189)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
 Hive.java:597)
 at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
 DDLTask.java:4194)
 at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
 java:281)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
 at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
 TaskRunner.java:85)





 On 2014/9/2 16:45, Sean Owen wrote:

 Hm, are you suggesting that the Spark distribution be a bag of 100
 JARs? It doesn't quite seem reasonable. It does not remove version
 conflicts, just pushes them to run-time, which isn't good. The
 assembly is also necessary because that's where shading happens. In
 development, you want to run against exactly what will be used in a
 real Spark distro.

 On Tue, Sep 2, 2014 at 9:39 AM, scwf wangf...@huawei.com wrote:

 hi, all
I suggest spark not use assembly jar as default run-time
 dependency(spark-submit/spark-class depend on assembly jar),use a
 library of
 all 3rd dependency jar like hadoop/hive/hbase more reasonable.

1 assembly jar packaged all 3rd jars into a big one, so we need
 rebuild
 this jar if we want to update the version of some component(such as
 hadoop)
2 in

hive client.getAllPartitions in lookupRelation can take a very long time

2014-09-02 Thread chutium

in our hive warehouse there are many tables with a lot of partitions, such as
scala hiveContext.sql(use db_external)
scala val result = hiveContext.sql(show partitions et_fullorders).count
result: Long = 5879

i noticed that this part of code:
https://github.com/apache/spark/blob/9d006c97371ddf357e0b821d5c6d1535d9b6fe41/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L55-L56

reads the whole partitions info at the beginning of plan phase, i added a
logInfo around this val partitions = ...

it shows:

scala val result = hiveContext.sql(select * from db_external.et_fullorders
limit 5)
14/09/02 16:15:56 INFO ParseDriver: Parsing command: select * from
db_external.et_fullorders limit 5
14/09/02 16:15:56 INFO ParseDriver: Parse Completed
14/09/02 16:15:56 INFO HiveContext$$anon$1: getAllPartitionsForPruner
started
14/09/02 16:17:35 INFO HiveContext$$anon$1: getAllPartitionsForPruner
finished

it took about 2min to get all partitions...

is there any possible way to avoid this operation? such as only fetch the
requested partition somehow?

Thanks



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/hive-client-getAllPartitions-in-lookupRelation-can-take-a-very-long-time-tp8186.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread shane knapp

so, i had a meeting w/the databricks guys on friday and they recommended i
send an email out to the list to say 'hi' and give you guys a quick intro.
 :)

hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
time getting the jenkins build infrastructure up to production quality.
 much of this will be 'under the covers' work, like better system level
auth, backups, etc, but some will definitely be user facing:  timely
jenkins updates, debugging broken build infrastructure and some plugin
support.

i've been working in the bay area now since 1997 at many different
companies, and my last 10 years has been split between google and palantir.
 i'm a huge proponent of OSS, and am really happy to be able to help with
the work you guys are doing!

if anyone has any requests/questions/comments, feel free to drop me a line!

shane

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Reynold Xin

Welcome, Shane!

On Tuesday, September 2, 2014, shane knapp skn...@berkeley.edu wrote:

 so, i had a meeting w/the databricks guys on friday and they recommended i
 send an email out to the list to say 'hi' and give you guys a quick intro.
  :)

 hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
 time getting the jenkins build infrastructure up to production quality.
  much of this will be 'under the covers' work, like better system level
 auth, backups, etc, but some will definitely be user facing:  timely
 jenkins updates, debugging broken build infrastructure and some plugin
 support.

 i've been working in the bay area now since 1997 at many different
 companies, and my last 10 years has been split between google and palantir.
  i'm a huge proponent of OSS, and am really happy to be able to help with
 the work you guys are doing!

 if anyone has any requests/questions/comments, feel free to drop me a line!

 shane

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Nicholas Chammas

Hi Shane!

Thank you for doing the Jenkins upgrade last week. It's nice to know that
infrastructure is gonna get some dedicated TLC going forward.

Welcome aboard!

Nick


On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote:

 so, i had a meeting w/the databricks guys on friday and they recommended i
 send an email out to the list to say 'hi' and give you guys a quick intro.
  :)

 hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
 time getting the jenkins build infrastructure up to production quality.
  much of this will be 'under the covers' work, like better system level
 auth, backups, etc, but some will definitely be user facing:  timely
 jenkins updates, debugging broken build infrastructure and some plugin
 support.

 i've been working in the bay area now since 1997 at many different
 companies, and my last 10 years has been split between google and palantir.
  i'm a huge proponent of OSS, and am really happy to be able to help with
 the work you guys are doing!

 if anyone has any requests/questions/comments, feel free to drop me a line!

 shane

Re: about spark assembly jar

2014-09-02 Thread Reynold Xin

Having a SSD help tremendously with assembly time.

Without that, you can do the following in order for Spark to pick up the
compiled classes before assembly at runtime.

export SPARK_PREPEND_CLASSES=true


On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 This doesn't help for every dependency, but Spark provides an option to
 build the assembly jar without Hadoop and its dependencies.  We make use of
 this in CDH packaging.

 -Sandy


 On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:

  Hi sean owen,
  here are some problems when i used assembly jar
  1 i put spark-assembly-*.jar to the lib directory of my application, it
  throw compile error
 
  Error:scalac: Error: class scala.reflect.BeanInfo not found.
  scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo not
  found.
 
  at scala.tools.nsc.symtab.Definitions$definitions$.
  getModuleOrClass(Definitions.scala:655)
 
  at scala.tools.nsc.symtab.Definitions$definitions$.
  getClass(Definitions.scala:608)
 
  at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
  init(GenJVM.scala:127)
 
  at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
  scala:85)
 
  at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
 
  at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
 
  at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
 
  at
 xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
 
  at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
 
  at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(
  NativeMethodAccessorImpl.java:39)
 
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(
  DelegatingMethodAccessorImpl.java:25)
 
  at java.lang.reflect.Method.invoke(Method.java:597)
 
  at sbt.compiler.AnalyzingCompiler.call(
  AnalyzingCompiler.scala:102)
 
  at sbt.compiler.AnalyzingCompiler.compile(
  AnalyzingCompiler.scala:48)
 
  at sbt.compiler.AnalyzingCompiler.compile(
  AnalyzingCompiler.scala:41)
 
  at org.jetbrains.jps.incremental.scala.local.
  IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
 
  at org.jetbrains.jps.incremental.scala.local.LocalServer.
  compile(LocalServer.scala:25)
 
  at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
  scala:58)
 
  at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
  Main.scala:21)
 
  at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
  Main.scala)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(
  NativeMethodAccessorImpl.java:39)
 
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(
  DelegatingMethodAccessorImpl.java:25)
 
  at java.lang.reflect.Method.invoke(Method.java:597)
 
  at com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
  2 i test my branch which updated hive version to org.apache.hive 0.13.1
it run successfully when use a bag of 3rd jars as dependency but throw
  error using assembly jar, it seems assembly jar lead to conflict
ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
  at org.apache.hadoop.hive.ql.io.parquet.serde.
  ArrayWritableObjectInspector.getObjectInspector(
  ArrayWritableObjectInspector.java:66)
  at org.apache.hadoop.hive.ql.io.parquet.serde.
  ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
  at org.apache.hadoop.hive.ql.io.parquet.serde.
  ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
  at org.apache.hadoop.hive.metastore.MetaStoreUtils.
  getDeserializer(MetaStoreUtils.java:339)
  at org.apache.hadoop.hive.ql.metadata.Table.
  getDeserializerFromMetaStore(Table.java:283)
  at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
  Table.java:189)
  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
  Hive.java:597)
  at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
  DDLTask.java:4194)
  at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
  java:281)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
  at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
  TaskRunner.java:85)
 
 
 
 
 
  On 2014/9/2 16:45, Sean Owen wrote:
 
  Hm, are you suggesting that the Spark distribution be a bag of 100
  JARs? It doesn't quite seem reasonable. It does not remove version
  conflicts, just pushes them to run-time, which isn't good. The
  assembly is also necessary because that's where shading happens. In
  development, you want to run against exactly what will be used in a
  real Spark distro.
 
  On Tue, Sep 2, 2014 at 9:39 AM, scwf

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Patrick Wendell

Hey Shane,

Thanks for your work so far and I'm really happy to see investment in
this infrastructure. This is a key productivity tool for us and
something we'd love to expand over time to improve the development
process of Spark.

- Patrick

On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Hi Shane!

 Thank you for doing the Jenkins upgrade last week. It's nice to know that
 infrastructure is gonna get some dedicated TLC going forward.

 Welcome aboard!

 Nick


 On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote:

 so, i had a meeting w/the databricks guys on friday and they recommended i
 send an email out to the list to say 'hi' and give you guys a quick intro.
  :)

 hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
 time getting the jenkins build infrastructure up to production quality.
  much of this will be 'under the covers' work, like better system level
 auth, backups, etc, but some will definitely be user facing:  timely
 jenkins updates, debugging broken build infrastructure and some plugin
 support.

 i've been working in the bay area now since 1997 at many different
 companies, and my last 10 years has been split between google and palantir.
  i'm a huge proponent of OSS, and am really happy to be able to help with
 the work you guys are doing!

 if anyone has any requests/questions/comments, feel free to drop me a line!

 shane


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Christopher Nguyen

Welcome, Shane. As a former prof and eng dir at Google, I've been expecting
this to be a first-class engineering college subject. I just didn't expect
it to come through this route :-)

So congrats, and I hope you represent the beginning of a great new trend at
universities.

Sent while mobile. Please excuse typos etc.
On Sep 2, 2014 11:00 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Shane,

 Thanks for your work so far and I'm really happy to see investment in
 this infrastructure. This is a key productivity tool for us and
 something we'd love to expand over time to improve the development
 process of Spark.

 - Patrick

 On Tue, Sep 2, 2014 at 10:47 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Hi Shane!
 
  Thank you for doing the Jenkins upgrade last week. It's nice to know that
  infrastructure is gonna get some dedicated TLC going forward.
 
  Welcome aboard!
 
  Nick
 
 
  On Tue, Sep 2, 2014 at 1:35 PM, shane knapp skn...@berkeley.edu wrote:
 
  so, i had a meeting w/the databricks guys on friday and they
 recommended i
  send an email out to the list to say 'hi' and give you guys a quick
 intro.
   :)
 
  hi!  i'm shane knapp, the new AMPLab devops engineer, and will be
 spending
  time getting the jenkins build infrastructure up to production quality.
   much of this will be 'under the covers' work, like better system level
  auth, backups, etc, but some will definitely be user facing:  timely
  jenkins updates, debugging broken build infrastructure and some plugin
  support.
 
  i've been working in the bay area now since 1997 at many different
  companies, and my last 10 years has been split between google and
 palantir.
   i'm a huge proponent of OSS, and am really happy to be able to help
 with
  the work you guys are doing!
 
  if anyone has any requests/questions/comments, feel free to drop me a
 line!
 
  shane
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Resource allocation

2014-09-02 Thread rapelly kartheek

Hi,

I want to incorporate some intelligence while choosing the resources for
rdd replication. I thought, if we replicate rdd on specially chosen nodes
based on the capabilities, the next application that requires this rdd can
be executed more efficiently. But, I found that an rdd creatd by an
appplication is owned by only that application and nobody else can access
it.

Can someone tell me what kind of operations can be done on a replicated
rdd. Or to put it other way, what are the benefits of a replicated rdd or
what operations can be performed on a replicated rdd.  I just want to know
how effective is my work going to be.

I'll be happy if some other ideas in the similar line of thought are
suggested.

Thank you!!
Karthik

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian

Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)

Maybe we should add a developer notes page to document all these useful
black magic.


On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote:

 Having a SSD help tremendously with assembly time.

 Without that, you can do the following in order for Spark to pick up the
 compiled classes before assembly at runtime.

 export SPARK_PREPEND_CLASSES=true


 On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com
 wrote:

  This doesn't help for every dependency, but Spark provides an option to
  build the assembly jar without Hadoop and its dependencies.  We make use
 of
  this in CDH packaging.
 
  -Sandy
 
 
  On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:
 
   Hi sean owen,
   here are some problems when i used assembly jar
   1 i put spark-assembly-*.jar to the lib directory of my application, it
   throw compile error
  
   Error:scalac: Error: class scala.reflect.BeanInfo not found.
   scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo
 not
   found.
  
   at scala.tools.nsc.symtab.Definitions$definitions$.
   getModuleOrClass(Definitions.scala:655)
  
   at scala.tools.nsc.symtab.Definitions$definitions$.
   getClass(Definitions.scala:608)
  
   at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
   init(GenJVM.scala:127)
  
   at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
   scala:85)
  
   at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
  
   at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
  
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
  
   at
  xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
  
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
  
   at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke(
   NativeMethodAccessorImpl.java:39)
  
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
   DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
  
   at sbt.compiler.AnalyzingCompiler.call(
   AnalyzingCompiler.scala:102)
  
   at sbt.compiler.AnalyzingCompiler.compile(
   AnalyzingCompiler.scala:48)
  
   at sbt.compiler.AnalyzingCompiler.compile(
   AnalyzingCompiler.scala:41)
  
   at org.jetbrains.jps.incremental.scala.local.
   IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
  
   at org.jetbrains.jps.incremental.scala.local.LocalServer.
   compile(LocalServer.scala:25)
  
   at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
   scala:58)
  
   at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
   Main.scala:21)
  
   at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
   Main.scala)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke(
   NativeMethodAccessorImpl.java:39)
  
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
   DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
  
   at
 com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
   2 i test my branch which updated hive version to org.apache.hive 0.13.1
 it run successfully when use a bag of 3rd jars as dependency but
 throw
   error using assembly jar, it seems assembly jar lead to conflict
 ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
   at org.apache.hadoop.hive.ql.io.parquet.serde.
   ArrayWritableObjectInspector.getObjectInspector(
   ArrayWritableObjectInspector.java:66)
   at org.apache.hadoop.hive.ql.io.parquet.serde.
  
 ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
   at org.apache.hadoop.hive.ql.io.parquet.serde.
   ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
   at org.apache.hadoop.hive.metastore.MetaStoreUtils.
   getDeserializer(MetaStoreUtils.java:339)
   at org.apache.hadoop.hive.ql.metadata.Table.
   getDeserializerFromMetaStore(Table.java:283)
   at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
   Table.java:189)
   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
   Hive.java:597)
   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
   DDLTask.java:4194)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
   java:281)
   at
 org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
   at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(
   TaskRunner.java:85)
  
  
  
  
  
   On 2014/9/2 16:45, Sean Owen wrote:
  
   Hm, are you suggesting that the Spark

Re: about spark assembly jar

2014-09-02 Thread Josh Rosen

SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably be 
easier to find): 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com) wrote:

Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)  

Maybe we should add a developer notes page to document all these useful  
black magic.  


On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote:  

 Having a SSD help tremendously with assembly time.  
  
 Without that, you can do the following in order for Spark to pick up the  
 compiled classes before assembly at runtime.  
  
 export SPARK_PREPEND_CLASSES=true  
  
  
 On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com  
 wrote:  
  
  This doesn't help for every dependency, but Spark provides an option to  
  build the assembly jar without Hadoop and its dependencies. We make use  
 of  
  this in CDH packaging.  
   
  -Sandy  
   
   
  On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:  
   
   Hi sean owen,  
   here are some problems when i used assembly jar  
   1 i put spark-assembly-*.jar to the lib directory of my application, it  
   throw compile error  

   Error:scalac: Error: class scala.reflect.BeanInfo not found.  
   scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo  
 not  
   found.  

   at scala.tools.nsc.symtab.Definitions$definitions$.  
   getModuleOrClass(Definitions.scala:655)  

   at scala.tools.nsc.symtab.Definitions$definitions$.  
   getClass(Definitions.scala:608)  

   at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.  
   init(GenJVM.scala:127)  

   at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.  
   scala:85)  

   at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)  

   at scala.tools.nsc.Global$Run.compile(Global.scala:1041)  

   at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)  

   at  
  xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)  

   at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)  

   at xsbt.CompilerInterface.run(CompilerInterface.scala:27)  

   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  

   at sun.reflect.NativeMethodAccessorImpl.invoke(  
   NativeMethodAccessorImpl.java:39)  

   at sun.reflect.DelegatingMethodAccessorImpl.invoke(  
   DelegatingMethodAccessorImpl.java:25)  

   at java.lang.reflect.Method.invoke(Method.java:597)  

   at sbt.compiler.AnalyzingCompiler.call(  
   AnalyzingCompiler.scala:102)  

   at sbt.compiler.AnalyzingCompiler.compile(  
   AnalyzingCompiler.scala:48)  

   at sbt.compiler.AnalyzingCompiler.compile(  
   AnalyzingCompiler.scala:41)  

   at org.jetbrains.jps.incremental.scala.local.  
   IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)  

   at org.jetbrains.jps.incremental.scala.local.LocalServer.  
   compile(LocalServer.scala:25)  

   at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.  
   scala:58)  

   at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(  
   Main.scala:21)  

   at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(  
   Main.scala)  

   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  

   at sun.reflect.NativeMethodAccessorImpl.invoke(  
   NativeMethodAccessorImpl.java:39)  

   at sun.reflect.DelegatingMethodAccessorImpl.invoke(  
   DelegatingMethodAccessorImpl.java:25)  

   at java.lang.reflect.Method.invoke(Method.java:597)  

   at  
 com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)  
   2 i test my branch which updated hive version to org.apache.hive 0.13.1  
   it run successfully when use a bag of 3rd jars as dependency but  
 throw  
   error using assembly jar, it seems assembly jar lead to conflict  
   ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo  
   at org.apache.hadoop.hive.ql.io.parquet.serde.  
   ArrayWritableObjectInspector.getObjectInspector(  
   ArrayWritableObjectInspector.java:66)  
   at org.apache.hadoop.hive.ql.io.parquet.serde.  

 ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)  
   at org.apache.hadoop.hive.ql.io.parquet.serde.  
   ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)  
   at org.apache.hadoop.hive.metastore.MetaStoreUtils.  
   getDeserializer(MetaStoreUtils.java:339)  
   at org.apache.hadoop.hive.ql.metadata.Table.  
   getDeserializerFromMetaStore(Table.java:283)  
   at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(  
   Table.java:189)  
   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(  
   Hive.java:597)  
   at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(  
   DDLTask.java:4194)  
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.  
   java:281)  
   at

Re: about spark assembly jar

2014-09-02 Thread Cheng Lian

Cool, didn't notice that, thanks Josh!


On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen rosenvi...@gmail.com wrote:

 SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could
 probably be easier to find):
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


 On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com)
 wrote:

 Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)

 Maybe we should add a developer notes page to document all these useful
 black magic.


 On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com wrote:

  Having a SSD help tremendously with assembly time.
 
  Without that, you can do the following in order for Spark to pick up the
  compiled classes before assembly at runtime.
 
  export SPARK_PREPEND_CLASSES=true
 
 
  On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com
  wrote:
 
   This doesn't help for every dependency, but Spark provides an option
 to
   build the assembly jar without Hadoop and its dependencies. We make
 use
  of
   this in CDH packaging.
  
   -Sandy
  
  
   On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com wrote:
  
Hi sean owen,
here are some problems when i used assembly jar
1 i put spark-assembly-*.jar to the lib directory of my application,
 it
throw compile error
   
Error:scalac: Error: class scala.reflect.BeanInfo not found.
scala.tools.nsc.MissingRequirementError: class
 scala.reflect.BeanInfo
  not
found.
   
at scala.tools.nsc.symtab.Definitions$definitions$.
getModuleOrClass(Definitions.scala:655)
   
at scala.tools.nsc.symtab.Definitions$definitions$.
getClass(Definitions.scala:608)
   
at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
init(GenJVM.scala:127)
   
at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
scala:85)
   
at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
   
at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
   
at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
   
at
   xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
   
at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
   
at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
   
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39)
   
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
   
at java.lang.reflect.Method.invoke(Method.java:597)
   
at sbt.compiler.AnalyzingCompiler.call(
AnalyzingCompiler.scala:102)
   
at sbt.compiler.AnalyzingCompiler.compile(
AnalyzingCompiler.scala:48)
   
at sbt.compiler.AnalyzingCompiler.compile(
AnalyzingCompiler.scala:41)
   
at org.jetbrains.jps.incremental.scala.local.
IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
   
at org.jetbrains.jps.incremental.scala.local.LocalServer.
compile(LocalServer.scala:25)
   
at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
scala:58)
   
at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
Main.scala:21)
   
at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
Main.scala)
   
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:39)
   
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:25)
   
at java.lang.reflect.Method.invoke(Method.java:597)
   
at
  com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
2 i test my branch which updated hive version to org.apache.hive
 0.13.1
it run successfully when use a bag of 3rd jars as dependency but
  throw
error using assembly jar, it seems assembly jar lead to conflict
ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
at org.apache.hadoop.hive.ql.io.parquet.serde.
ArrayWritableObjectInspector.getObjectInspector(
ArrayWritableObjectInspector.java:66)
at org.apache.hadoop.hive.ql.io.parquet.serde.
   
 
 ArrayWritableObjectInspector.init(ArrayWritableObjectInspector.java:59)
at org.apache.hadoop.hive.ql.io.parquet.serde.
ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:113)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.
getDeserializer(MetaStoreUtils.java:339)
at org.apache.hadoop.hive.ql.metadata.Table.
getDeserializerFromMetaStore(Table.java:283)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(
Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(
Hive.java:597)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(
DDLTask.java:4194)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.
java:281)
at

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Henry Saputra

Welcome Shane =)


- Henry

On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote:
 so, i had a meeting w/the databricks guys on friday and they recommended i
 send an email out to the list to say 'hi' and give you guys a quick intro.
  :)

 hi!  i'm shane knapp, the new AMPLab devops engineer, and will be spending
 time getting the jenkins build infrastructure up to production quality.
  much of this will be 'under the covers' work, like better system level
 auth, backups, etc, but some will definitely be user facing:  timely
 jenkins updates, debugging broken build infrastructure and some plugin
 support.

 i've been working in the bay area now since 1997 at many different
 companies, and my last 10 years has been split between google and palantir.
  i'm a huge proponent of OSS, and am really happy to be able to help with
 the work you guys are doing!

 if anyone has any requests/questions/comments, feel free to drop me a line!

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: hey spark developers! intro from shane knapp, devops engineer @ AMPLab

2014-09-02 Thread Cheng Lian

Welcome Shane! Glad to see that finally a hero jumping out to tame Jenkins
:)


On Tue, Sep 2, 2014 at 12:44 PM, Henry Saputra henry.sapu...@gmail.com
wrote:

 Welcome Shane =)


 - Henry

 On Tue, Sep 2, 2014 at 10:35 AM, shane knapp skn...@berkeley.edu wrote:
  so, i had a meeting w/the databricks guys on friday and they recommended
 i
  send an email out to the list to say 'hi' and give you guys a quick
 intro.
   :)
 
  hi!  i'm shane knapp, the new AMPLab devops engineer, and will be
 spending
  time getting the jenkins build infrastructure up to production quality.
   much of this will be 'under the covers' work, like better system level
  auth, backups, etc, but some will definitely be user facing:  timely
  jenkins updates, debugging broken build infrastructure and some plugin
  support.
 
  i've been working in the bay area now since 1997 at many different
  companies, and my last 10 years has been split between google and
 palantir.
   i'm a huge proponent of OSS, and am really happy to be able to help with
  the work you guys are doing!
 
  if anyone has any requests/questions/comments, feel free to drop me a
 line!
 
  shane

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Will Benton

+1

Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK 8).


best,
wb


- Original Message -
 From: Patrick Wendell pwend...@gmail.com
 To: dev@spark.apache.org
 Sent: Saturday, August 30, 2014 5:07:52 PM
 Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
 
 Please vote on releasing the following candidate as Apache Spark version
 1.1.0!
 
 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1030/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
 
 Please vote on releasing this package as Apache Spark 1.1.0!
 
 The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 == Regressions fixed since RC1 ==
 - Build issue for SQL support:
 https://issues.apache.org/jira/browse/SPARK-3234
 - EC2 script version bump to 1.1.0.
 
 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not block
 this release.
 
 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now snappy
 -- Old behavior can be restored by switching to lzf
 
 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill to false.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Cheng Lian

+1

   - Tested Thrift server and SQL CLI locally on OSX 10.9.
   - Checked datanucleus dependencies in distribution tarball built by
   make-distribution.sh without SPARK_HIVE defined.




On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:

 +1

 Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle JDK
 8).


 best,
 wb


 - Original Message -
  From: Patrick Wendell pwend...@gmail.com
  To: dev@spark.apache.org
  Sent: Saturday, August 30, 2014 5:07:52 PM
  Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
 
  Please vote on releasing the following candidate as Apache Spark version
  1.1.0!
 
  The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc3/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1030/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
 
  Please vote on releasing this package as Apache Spark 1.1.0!
 
  The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
  a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.1.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == Regressions fixed since RC1 ==
  - Build issue for SQL support:
  https://issues.apache.org/jira/browse/SPARK-3234
  - EC2 script version bump to 1.1.0.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening very late into the QA period compared with
  previous votes, so -1 votes should only occur for significant
  regressions from 1.0.2. Bugs already present in 1.0.X will not block
  this release.
 
  == What default changes should I be aware of? ==
  1. The default value of spark.io.compression.codec is now snappy
  -- Old behavior can be restored by switching to lzf
 
  2. PySpark now performs external spilling during aggregations.
  -- Old behavior can be restored by setting spark.shuffle.spill to
 false.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Checkpointing Pregel

2014-09-02 Thread Jeffrey Picard

Hey guys,

I’m trying to run connected components on graphs that end up running for a 
fairly large number of iterations (25-30) and take 5-6 hours. I find more than 
half the time I end up getting fetch failures and losing an executor after a 
number of iterations. Then it has to go back and recompute pieces that it lost, 
which don’t seem to be getting persisted at the same level as the graph so 
those iterations take exponentially longer and I have to kill the job because 
it’s not worth waiting for it to finish.

The approach I’m currently trying is checkpointing the vertices and edges (and 
maybe the messages?) in Pregel. What I’ve been testing with so far is the below 
patch, which seems to be working (actually I haven’t had any failures since I 
added this change, so I don’t know if I did get one if it would recompute from 
the start or not) but I’m also seeing things like 5 instances of VertexRDDs 
being persisted all at the same time and “reduce at VertexRDD.scala:111” runs 
twice each time. I was wondering if this is the proper / most efficient way of 
doing this checkpointing, and if not what would work better?

diff --git a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
index 5e55620..5be40c3 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
@@ -134,6 +134,11 @@ object Pregel extends Logging {
   g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = 
newOpt.getOrElse(old) }
   g.cache()

+  g.vertices.checkpoint()
+  g.vertices.count()
+  g.edges.checkpoint()
+  g.edges.count()
+
   val oldMessages = messages
   // Send new messages. Vertices that didn't get any messages don't appear 
in newVerts, so don't
   // get to send messages. We must cache messages so it can be 
materialized on the next line,
@@ -142,6 +147,7 @@ object Pregel extends Logging {
   // The call to count() materializes `messages`, `newVerts`, and the 
vertices of `g`. This
   // hides oldMessages (depended on by newVerts), newVerts (depended on by 
messages), and the
   // vertices of prevG (depended on by newVerts, oldMessages, and the 
vertices of g).
+ messages.checkpoint()
   activeMessages = messages.count()

   logInfo(Pregel finished iteration  + i)

Best Regards,

Jeffrey Picard


signature.asc
Description: Message signed with OpenPGP using GPGMail

Ask something about spark

2014-09-02 Thread Sanghoon Lee

Hi, I am phoenixlee and a Spark programmer in Korea.

And be a good chance this time, it tries to teach college students and
office workers to Spark.
This course will be done with the support of the government. Can I use the
data(pictures, samples, etc.) in the spark homepage for this course? Of
course, I will put the comments in thanks and webpage URL. It would be a
good opportunity, even though the findings were that there is no teaching
materials Spark and education (or community) still in Korea.

Thanks.
ᐧ

Re: Ask something about spark

2014-09-02 Thread Reynold Xin

I think in general that is fine. It would be great if your slides come with
proper attribution.


On Tue, Sep 2, 2014 at 3:31 PM, Sanghoon Lee phoenixl...@gmail.com wrote:

 Hi, I am phoenixlee and a Spark programmer in Korea.

 And be a good chance this time, it tries to teach college students and
 office workers to Spark.
 This course will be done with the support of the government. Can I use the
 data(pictures, samples, etc.) in the spark homepage for this course? Of
 course, I will put the comments in thanks and webpage URL. It would be a
 good opportunity, even though the findings were that there is no teaching
 materials Spark and education (or community) still in Korea.

 Thanks.
 ᐧ

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Reynold Xin

+1


On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote:

 +1

- Tested Thrift server and SQL CLI locally on OSX 10.9.
- Checked datanucleus dependencies in distribution tarball built by
make-distribution.sh without SPARK_HIVE defined.

 


 On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:

  +1
 
  Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle
 JDK
  8).
 
 
  best,
  wb
 
 
  - Original Message -
   From: Patrick Wendell pwend...@gmail.com
   To: dev@spark.apache.org
   Sent: Saturday, August 30, 2014 5:07:52 PM
   Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
  
   Please vote on releasing the following candidate as Apache Spark
 version
   1.1.0!
  
   The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.1.0-rc3/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1030/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
  
   Please vote on releasing this package as Apache Spark 1.1.0!
  
   The vote is open until Tuesday, September 02, at 23:07 UTC and passes
 if
   a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.1.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == Regressions fixed since RC1 ==
   - Build issue for SQL support:
   https://issues.apache.org/jira/browse/SPARK-3234
   - EC2 script version bump to 1.1.0.
  
   == What justifies a -1 vote for this release? ==
   This vote is happening very late into the QA period compared with
   previous votes, so -1 votes should only occur for significant
   regressions from 1.0.2. Bugs already present in 1.0.X will not block
   this release.
  
   == What default changes should I be aware of? ==
   1. The default value of spark.io.compression.codec is now snappy
   -- Old behavior can be restored by switching to lzf
  
   2. PySpark now performs external spilling during aggregations.
   -- Old behavior can be restored by setting spark.shuffle.spill to
  false.
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Kan Zhang

+1

Verified PySpark InputFormat/OutputFormat examples.


On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote:

 +1


 On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com wrote:

  +1
 
 - Tested Thrift server and SQL CLI locally on OSX 10.9.
 - Checked datanucleus dependencies in distribution tarball built by
 make-distribution.sh without SPARK_HIVE defined.
 
  
 
 
  On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:
 
   +1
  
   Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9 (Oracle
  JDK
   8).
  
  
   best,
   wb
  
  
   - Original Message -
From: Patrick Wendell pwend...@gmail.com
To: dev@spark.apache.org
Sent: Saturday, August 30, 2014 5:07:52 PM
Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)
   
Please vote on releasing the following candidate as Apache Spark
  version
1.1.0!
   
The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
   
The release files, including signatures, digests, etc. can be found
 at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3/
   
Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc
   
The staging repository for this release can be found at:
   
  https://repository.apache.org/content/repositories/orgapachespark-1030/
   
The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/
   
Please vote on releasing this package as Apache Spark 1.1.0!
   
The vote is open until Tuesday, September 02, at 23:07 UTC and passes
  if
a majority of at least 3 +1 PMC votes are cast.
   
[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...
   
To learn more about Apache Spark, please see
http://spark.apache.org/
   
== Regressions fixed since RC1 ==
- Build issue for SQL support:
https://issues.apache.org/jira/browse/SPARK-3234
- EC2 script version bump to 1.1.0.
   
== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.
   
== What default changes should I be aware of? ==
1. The default value of spark.io.compression.codec is now snappy
-- Old behavior can be restored by switching to lzf
   
2. PySpark now performs external spilling during aggregations.
-- Old behavior can be restored by setting spark.shuffle.spill to
   false.
   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Michael Armbrust

+1


On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 +1

 Tested on Mac OS X.

 Matei

 On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote:

 +1

 Verified PySpark InputFormat/OutputFormat examples.


 On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote:

  +1
 
 
  On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com
 wrote:
 
   +1
  
   - Tested Thrift server and SQL CLI locally on OSX 10.9.
   - Checked datanucleus dependencies in distribution tarball built by
   make-distribution.sh without SPARK_HIVE defined.
  
   
  
  
   On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:
  
+1
   
Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9
 (Oracle
   JDK
8).
   
   
best,
wb
   
   
- Original Message -
 From: Patrick Wendell pwend...@gmail.com
 To: dev@spark.apache.org
 Sent: Saturday, August 30, 2014 5:07:52 PM
 Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)

 Please vote on releasing the following candidate as Apache Spark
   version
 1.1.0!

 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):

   
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

 The release files, including signatures, digests, etc. can be found
  at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:

  
 https://repository.apache.org/content/repositories/orgapachespark-1030/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.1.0!

 The vote is open until Tuesday, September 02, at 23:07 UTC and
 passes
   if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == Regressions fixed since RC1 ==
 - Build issue for SQL support:
 https://issues.apache.org/jira/browse/SPARK-3234
 - EC2 script version bump to 1.1.0.

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not
 block
 this release.

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now
 snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill
 to
false.


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


   
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Denny Lee

+1  Tested on Mac OSX, Thrift Server, SparkSQL


On September 2, 2014 at 17:29:29, Michael Armbrust (mich...@databricks.com) 
wrote:

+1  


On Tue, Sep 2, 2014 at 5:18 PM, Matei Zaharia matei.zaha...@gmail.com  
wrote:  

 +1  
  
 Tested on Mac OS X.  
  
 Matei  
  
 On September 2, 2014 at 5:03:19 PM, Kan Zhang (kzh...@apache.org) wrote:  
  
 +1  
  
 Verified PySpark InputFormat/OutputFormat examples.  
  
  
 On Tue, Sep 2, 2014 at 4:10 PM, Reynold Xin r...@databricks.com wrote:  
  
  +1  
   
   
  On Tue, Sep 2, 2014 at 3:08 PM, Cheng Lian lian.cs@gmail.com  
 wrote:  
   
   +1  

   - Tested Thrift server and SQL CLI locally on OSX 10.9.  
   - Checked datanucleus dependencies in distribution tarball built by  
   make-distribution.sh without SPARK_HIVE defined.  

     


   On Tue, Sep 2, 2014 at 2:30 PM, Will Benton wi...@redhat.com wrote:  

+1  
 
Tested Scala/MLlib apps on Fedora 20 (OpenJDK 7) and OS X 10.9  
 (Oracle  
   JDK  
8).  
 
 
best,  
wb  
 
 
- Original Message -  
 From: Patrick Wendell pwend...@gmail.com  
 To: dev@spark.apache.org  
 Sent: Saturday, August 30, 2014 5:07:52 PM  
 Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)  
  
 Please vote on releasing the following candidate as Apache Spark  
   version  
 1.1.0!  
  
 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):  
  
 

   
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad
   
  
 The release files, including signatures, digests, etc. can be found  
  at:  
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/  
  
 Release artifacts are signed with the following key:  
 https://people.apache.org/keys/committer/pwendell.asc  
  
 The staging repository for this release can be found at:  
  

 https://repository.apache.org/content/repositories/orgapachespark-1030/  
  
 The documentation corresponding to this release can be found at:  
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/  
  
 Please vote on releasing this package as Apache Spark 1.1.0!  
  
 The vote is open until Tuesday, September 02, at 23:07 UTC and  
 passes  
   if  
 a majority of at least 3 +1 PMC votes are cast.  
  
 [ ] +1 Release this package as Apache Spark 1.1.0  
 [ ] -1 Do not release this package because ...  
  
 To learn more about Apache Spark, please see  
 http://spark.apache.org/  
  
 == Regressions fixed since RC1 ==  
 - Build issue for SQL support:  
 https://issues.apache.org/jira/browse/SPARK-3234  
 - EC2 script version bump to 1.1.0.  
  
 == What justifies a -1 vote for this release? ==  
 This vote is happening very late into the QA period compared with  
 previous votes, so -1 votes should only occur for significant  
 regressions from 1.0.2. Bugs already present in 1.0.X will not  
 block  
 this release.  
  
 == What default changes should I be aware of? ==  
 1. The default value of spark.io.compression.codec is now  
 snappy  
 -- Old behavior can be restored by switching to lzf  
  
 2. PySpark now performs external spilling during aggregations.  
 -- Old behavior can be restored by setting spark.shuffle.spill  
 to  
false.  
  
  
 -  
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
 For additional commands, e-mail: dev-h...@spark.apache.org  
  
  
 
-  
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org  
For additional commands, e-mail: dev-h...@spark.apache.org

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Sean McNamara

+1

From: Patrick Wendell [pwend...@gmail.com]
Sent: Saturday, August 30, 2014 4:08 PM
To: dev@spark.apache.org
Subject: [VOTE] Release Apache Spark 1.1.0 (RC3)

Please vote on releasing the following candidate as Apache Spark version 1.1.0!

The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1030/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

Please vote on releasing this package as Apache Spark 1.1.0!

The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== Regressions fixed since RC1 ==
- Build issue for SQL support: https://issues.apache.org/jira/browse/SPARK-3234
- EC2 script version bump to 1.1.0.

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.0.X will not block
this release.

== What default changes should I be aware of? ==
1. The default value of spark.io.compression.codec is now snappy
-- Old behavior can be restored by switching to lzf

2. PySpark now performs external spilling during aggregations.
-- Old behavior can be restored by setting spark.shuffle.spill to false.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RE: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Jeremy Freeman

+1



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: quick jenkins restart

2014-09-02 Thread shane knapp

and we're back and building!


On Tue, Sep 2, 2014 at 5:07 PM, shane knapp skn...@berkeley.edu wrote:

 since our queue is really short, i'm waiting for a couple of builds to
 finish and will be restarting jenkins to install/update some plugins.  the
 github pull request builder looks like it has some fixes to reduce spammy
 github calls, and reduce any potential rate limiting.

 i'll let everyone know when it's back up...  this should be super quick
 (~15 mins for tests to finish, ~2 mins for jenkins to restart).

 thanks in advance!

 shane

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Paolo Platter

+1
Tested on HDP 2.1 Sandbox, Thrift Server with Simba Shark ODBC

Paolo

Da: Jeremy Freemanmailto:freeman.jer...@gmail.com
Data invio: ?mercoled?? ?3? ?settembre? ?2014 ?02?:?34
A: d...@spark.incubator.apache.orgmailto:d...@spark.incubator.apache.org

+1



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC3-tp8147p8211.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Nicholas Chammas

In light of the discussion on SPARK-, I'll revoke my -1 vote. The
issue does not appear to be serious.


On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 -1: I believe I've found a regression from 1.0.2. The report is captured
 in SPARK- https://issues.apache.org/jira/browse/SPARK-.


 On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.1.0!

 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1030/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.1.0!

 The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == Regressions fixed since RC1 ==
 - Build issue for SQL support:
 https://issues.apache.org/jira/browse/SPARK-3234
 - EC2 script version bump to 1.1.0.

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not block
 this release.

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill to
 false.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: about spark assembly jar

2014-09-02 Thread scwf


Yea, SSD + SPARK_PREPEND_CLASSES is great for iterative development!

Then why it is ok with a bag of 3rd jars but throw error with assembly jar, any 
one have idea?

On 2014/9/3 2:57, Cheng Lian wrote:

Cool, didn't notice that, thanks Josh!


On Tue, Sep 2, 2014 at 11:55 AM, Josh Rosen rosenvi...@gmail.com 
mailto:rosenvi...@gmail.com wrote:

SPARK_PREPEND_CLASSES is documented on the Spark Wiki (which could probably 
be easier to find): 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


On September 2, 2014 at 11:53:49 AM, Cheng Lian (lian.cs@gmail.com 
mailto:lian.cs@gmail.com) wrote:


Yea, SSD + SPARK_PREPEND_CLASSES totally changed my life :)

Maybe we should add a developer notes page to document all these useful
black magic.


On Tue, Sep 2, 2014 at 10:54 AM, Reynold Xin r...@databricks.com 
mailto:r...@databricks.com wrote:

 Having a SSD help tremendously with assembly time.

 Without that, you can do the following in order for Spark to pick up the
 compiled classes before assembly at runtime.

 export SPARK_PREPEND_CLASSES=true


 On Tue, Sep 2, 2014 at 9:10 AM, Sandy Ryza sandy.r...@cloudera.com 
mailto:sandy.r...@cloudera.com
 wrote:

  This doesn't help for every dependency, but Spark provides an option to
  build the assembly jar without Hadoop and its dependencies.  We make use
 of
  this in CDH packaging.
 
  -Sandy
 
 
  On Tue, Sep 2, 2014 at 2:12 AM, scwf wangf...@huawei.com 
mailto:wangf...@huawei.com wrote:
 
   Hi sean owen,
   here are some problems when i used assembly jar
   1 i put spark-assembly-*.jar to the lib directory of my application, 
it
   throw compile error
  
   Error:scalac: Error: class scala.reflect.BeanInfo not found.
   scala.tools.nsc.MissingRequirementError: class scala.reflect.BeanInfo
 not
   found.
  
   at scala.tools.nsc.symtab.Definitions$definitions$.
   getModuleOrClass(Definitions.scala:655)
  
   at scala.tools.nsc.symtab.Definitions$definitions$.
   getClass(Definitions.scala:608)
  
   at scala.tools.nsc.backend.jvm.GenJVM$BytecodeGenerator.
   init(GenJVM.scala:127)
  
   at scala.tools.nsc.backend.jvm.GenJVM$JvmPhase.run(GenJVM.
   scala:85)
  
   at scala.tools.nsc.Global$Run.compileSources(Global.scala:953)
  
   at scala.tools.nsc.Global$Run.compile(Global.scala:1041)
  
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:126)
  
   at
  xsbt.CachedCompiler0.liftedTree1$1(CompilerInterface.scala:102)
  
   at xsbt.CachedCompiler0.run(CompilerInterface.scala:102)
  
   at xsbt.CompilerInterface.run(CompilerInterface.scala:27)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke(
   NativeMethodAccessorImpl.java:39)
  
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
   DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
  
   at sbt.compiler.AnalyzingCompiler.call(
   AnalyzingCompiler.scala:102)
  
   at sbt.compiler.AnalyzingCompiler.compile(
   AnalyzingCompiler.scala:48)
  
   at sbt.compiler.AnalyzingCompiler.compile(
   AnalyzingCompiler.scala:41)
  
   at org.jetbrains.jps.incremental.scala.local.
   IdeaIncrementalCompiler.compile(IdeaIncrementalCompiler.scala:28)
  
   at org.jetbrains.jps.incremental.scala.local.LocalServer.
   compile(LocalServer.scala:25)
  
   at org.jetbrains.jps.incremental.scala.remote.Main$.make(Main.
   scala:58)
  
   at org.jetbrains.jps.incremental.scala.remote.Main$.nailMain(
   Main.scala:21)
  
   at org.jetbrains.jps.incremental.scala.remote.Main.nailMain(
   Main.scala)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  
   at sun.reflect.NativeMethodAccessorImpl.invoke(
   NativeMethodAccessorImpl.java:39)
  
   at sun.reflect.DelegatingMethodAccessorImpl.invoke(
   DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
  
   at
 com.martiansoftware.nailgun.NGSession.run(NGSession.java:319)
   2 i test my branch which updated hive version to org.apache.hive 
0.13.1
 it run successfully when use a bag of 3rd jars as dependency but
 throw
   error using assembly jar, it seems assembly jar lead to conflict
 ERROR DDLTask: java.lang.NoSuchFieldError: doubleTypeInfo
   at org.apache.hadoop.hive.ql.io.parquet.serde.

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-09-02 Thread Patrick Wendell

Thanks everyone for voting on this. There were two minor issues (one a
blocker) were found that warrant cutting a new RC. For those who voted
+1 on this release, I'd encourage you to +1 rc4 when it comes out
unless you have been testing issues specific to the EC2 scripts. This
will move the release process along.

SPARK-3332 - Issue with tagging in EC2 scripts
SPARK-3358 - Issue with regression for m3.XX instances

- Patrick

On Tue, Sep 2, 2014 at 6:55 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 In light of the discussion on SPARK-, I'll revoke my -1 vote. The
 issue does not appear to be serious.


 On Sun, Aug 31, 2014 at 5:14 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 -1: I believe I've found a regression from 1.0.2. The report is captured
 in SPARK-.


 On Sat, Aug 30, 2014 at 6:07 PM, Patrick Wendell pwend...@gmail.com
 wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.1.0!

 The tag to be voted on is v1.1.0-rc3 (commit b2d0493b):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b2d0493b223c5f98a593bb6d7372706cc02bebad

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1030/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.1.0-rc3-docs/

 Please vote on releasing this package as Apache Spark 1.1.0!

 The vote is open until Tuesday, September 02, at 23:07 UTC and passes if
 a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.1.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == Regressions fixed since RC1 ==
 - Build issue for SQL support:
 https://issues.apache.org/jira/browse/SPARK-3234
 - EC2 script version bump to 1.1.0.

 == What justifies a -1 vote for this release? ==
 This vote is happening very late into the QA period compared with
 previous votes, so -1 votes should only occur for significant
 regressions from 1.0.2. Bugs already present in 1.0.X will not block
 this release.

 == What default changes should I be aware of? ==
 1. The default value of spark.io.compression.codec is now snappy
 -- Old behavior can be restored by switching to lzf

 2. PySpark now performs external spilling during aggregations.
 -- Old behavior can be restored by setting spark.shuffle.spill to
 false.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: [Spark SQL] off-heap columnar store

2014-09-02 Thread Evan Chan

On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell i...@ianoconnell.com wrote:
 I'm not sure what you mean here? Parquet is at its core just a format, you
 could store that data anywhere.

 Though it sounds like you saying, correct me if i'm wrong: you basically
 want a columnar abstraction layer where you can provide a different backing
 implementation to keep the columns rather than parquet-mr?

 I.e. you want to be able to produce a schema RDD from something like
 vertica, where updates should act as a write through cache back to vertica
 itself?

Something like that.

I'd like,

1)  An API to produce a schema RDD from an RDD of columns, not rows.
  However, an RDD[Column] would not make sense, since it would be
spread out across partitions.  Perhaps what is needed is a
Seq[RDD[ColumnSegment]].The idea is that each RDD would hold the
segments for one column.  The segments represent a range of rows.
This would then read from something like Vertica or Cassandra.

2)  A variant of 1) where you could read this data from Tachyon.
Tachyon is supposed to support a columnar representation of data, it
did for Shark 0.9.x.

The goal is basically to load columnar data from something like
Cassandra into Tachyon, with the compression ratio of columnar
storage, and the speed of InMemoryColumnarTableScan.   If data is
appended into the Tachyon representation, be able to cache it back.
The write back is not as high a priority though.

A workaround would be to read data from Cassandra/Vertica/etc. and
write back into Parquet, but this would take a long time and incur
huge I/O overhead.


 I'm sorry it just sounds like its worth clearly defining what your key
 requirement/goal is.


 On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan velvia.git...@gmail.com wrote:

 
  The reason I'm asking about the columnar compressed format is that
  there are some problems for which Parquet is not practical.
 
 
  Can you elaborate?

 Sure.

 - Organization or co has no Hadoop, but significant investment in some
 other NoSQL store.
 - Need to efficiently add a new column to existing data
 - Need to mark some existing rows as deleted or replace small bits of
 existing data

 For these use cases, it would be much more efficient and practical if
 we didn't have to take the origin of the data from the datastore,
 convert it to Parquet first.  Doing so loses significant latency and
 causes Ops headaches in having to maintain HDFS. It would be great
 to be able to load data directly into the columnar format, into the
 InMemoryColumnarCache.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

40 matches

Mail list logo