date:20141107

Re: Bug in Accumulators...

2014-11-07 Thread Shixiong Zhu

Could you provide all pieces of codes which can reproduce the bug? Here is
my test code:

import org.apache.spark._
import org.apache.spark.SparkContext._

object SimpleApp {

  def main(args: Array[String]) {
val conf = new SparkConf().setAppName(SimpleApp)
val sc = new SparkContext(conf)

val accum = sc.accumulator(0)
for (i - 1 to 10) {
  sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
}
sc.stop()
  }
}

It works fine both in client and cluster. Since this is a serialization
bug, the outer class does matter. Could you provide it? Is there
a SparkContext field in the outer class?

Best Regards,
Shixiong Zhu

2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch:

 I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works
 if I
 run it in local mode! )

 If I put the accumulator inside the for loop, everything will work fine. I
 guess the bug is that an accumulator can be applied to JUST one RDD.

 Still another undocumented 'feature' of Spark that no one from the people
 who maintain Spark is willing to solve or at least to tell us about ...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Bug in Accumulators...

2014-11-07 Thread Aaron Davidson

This may be due in part to Scala allocating an anonymous inner class in
order to execute the for loop. I would expect if you change it to a while
loop like

var i = 0
while (i  10) {
  sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
  i += 1
}

then the problem may go away. I am not super familiar with the closure
cleaner, but I believe that we cannot prune beyond 1 layer of references,
so the extra class of nesting may be screwing something up. If this is the
case, then I would also expect replacing the accumulator with any other
reference to the enclosing scope (such as a broadcast variable) would have
the same result.

On Fri, Nov 7, 2014 at 12:03 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 Could you provide all pieces of codes which can reproduce the bug? Here is
 my test code:

 import org.apache.spark._
 import org.apache.spark.SparkContext._

 object SimpleApp {

   def main(args: Array[String]) {
 val conf = new SparkConf().setAppName(SimpleApp)
 val sc = new SparkContext(conf)

 val accum = sc.accumulator(0)
 for (i - 1 to 10) {
   sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
 }
 sc.stop()
   }
 }

 It works fine both in client and cluster. Since this is a serialization
 bug, the outer class does matter. Could you provide it? Is there
 a SparkContext field in the outer class?

 Best Regards,
 Shixiong Zhu

 2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch:

 I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works
 if I
 run it in local mode! )

 If I put the accumulator inside the for loop, everything will work fine. I
 guess the bug is that an accumulator can be applied to JUST one RDD.

 Still another undocumented 'feature' of Spark that no one from the people
 who maintain Spark is willing to solve or at least to tell us about ...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu

Michael any idea on this?

From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 2:36 PM
To: mich...@databricks.com; user
Subject: CheckPoint Issue with JsonRDD

When we enable checkpoint and use JsonRDD we get the following error: Is this 
bug ?


Exception in thread main java.lang.NullPointerException
at org.apache.spark.rdd.RDD.init(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
at 
org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

=

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{Logging, SparkConf}
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}


object SparkStreamingToParquet extends Logging {


  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
if (args.length  3) {
  logInfo(Please provide valid parameters: hdfsFilesLocation: 
hdfs://ip:8020/user/hdfs/--/ IMPALAtableloc hdfs://ip:8020/user/hive/--/ 
tablename)
  logInfo(make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/)
  System.exit(1)
}
val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()={
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val sparkConf: SparkConf = new SparkConf().setAppName(Json to 
Parquet).set(spark.cores.max, 3)

val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val hivecontext: HiveContext = new HiveContext(jssc.sparkContext)

hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);

val schemaString = name age
val schema =
  StructType(
schemaString.split( ).map(fieldName = StructField(fieldName, 
StringType, true)))

val textFileStream = jssc.textFileStream(HDFS_FILE_LOC)

textFileStream.foreachRDD(rdd = {
  if(rdd !=null  rdd.count()0) {
  val schRdd =  hivecontext.jsonRDD(rdd,schema)
  logInfo(inserting into table:  + TEMP_TABLE_NAME)
  schRdd.insertInto(TEMP_TABLE_NAME)
  }
})
jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}



case class Person(name:String, age:String) extends Serializable

Regards,
Madhu jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe,

sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta

I am trying to group by on a calculated field. Is it supported on spark sql? I 
am running it on a nested json structure.
 
Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c 
group by YEAR(c.Patient.DOB)
 
Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.
Error: 
 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB 
AS DOB#191) AS c_0#185, tree:
Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)], 
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS DOB#191) 
AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192, LongType)) AS 
c_1#186L]
 Subquery c
  Subquery claim
   LogicalRDD 
[AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14],
 MappedRDD[5] at map at JsonRDD.scala:43
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)
at $iwC$$iwC$$iwC$$iwC.init(console:17)
at $iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC.init(console:24)
at $iwC.init(console:26)
at init(console:28)
at .init(console:32)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at

about write mongodb in mapPartitions

2014-11-07 Thread qinwei







Hi, everyone
    I come across with a prolem about writing data to mongodb in mapPartitions, 
my code is as below:                 val sourceRDD = 
sc.textFile(hdfs://host:port/sourcePath)          // some transformations     
   val rdd= sourceRDD .map(mapFunc).filter(filterFunc)        val newRDD = 
rdd.mapPartitions(args = {             val mongoClient = new 
MongoClient(host, port) 
            val db = mongoClient.getDB(db) 
            val coll = db.getCollection(collectionA) 

            args.map(arg = { 
                coll.insert(new BasicDBObject(pkg, arg)) 
                arg 
    }) 

            mongoClient.close() 
            args 
        })            newRDD.saveAsTextFile(hdfs://host:port/path)        The 
application saved data to HDFS correctly, but not mongodb, is there someting 
wrong?    I know that collecting the newRDD to driver and then saving it to 
mongodb will success, but will the following saveAsTextFile read the filesystem 
once again?
    Thanks    

qinwei

Re: about write mongodb in mapPartitions

2014-11-07 Thread Akhil Das

Why not saveAsNewAPIHadoopFile?


//Define your mongoDB confs

val config = new Configuration()

 config.set(mongo.output.uri, mongodb://
127.0.0.1:27017/sigmoid.output)

//Write everything to mongo
 rdd.saveAsNewAPIHadoopFile(file:///some/random, classOf[Any],
classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]],
config)


Thanks
Best Regards

On Fri, Nov 7, 2014 at 2:53 PM, qinwei wei@dewmobile.net wrote:

 Hi, everyone

 I come across with a prolem about writing data to mongodb in
 mapPartitions, my code is as below:

  val sourceRDD = sc.textFile(hdfs://host:port/sourcePath)
   // some transformations
 val rdd= sourceRDD .map(mapFunc).filter(filterFunc)
 val newRDD = rdd.mapPartitions(args = {
 val mongoClient = new MongoClient(host, port)
 val db = mongoClient.getDB(db)
 val coll = db.getCollection(collectionA)

 args.map(arg = {
 coll.insert(new BasicDBObject(pkg, arg))
 arg
 })

 mongoClient.close()
 args
 })

 newRDD.saveAsTextFile(hdfs://host:port/path)

 The application saved data to HDFS correctly, but not mongodb, is
 there someting wrong?
 I know that collecting the newRDD to driver and then saving it to
 mongodb will success, but will the following saveAsTextFile read the
 filesystem once again?

 Thanks


 --
 qinwei

Re: multiple spark context in same driver program

2014-11-07 Thread Akhil Das

My bad, I just fired up a spark-shell and created a new sparkContext and it
was working fine. I basically did a parallelize and collect with both
sparkContexts.

Thanks
Best Regards

On Fri, Nov 7, 2014 at 3:17 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Fri, Nov 7, 2014 at 4:58 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 That doc was created during the initial days (Spark 0.8.0), you can of
 course create multiple sparkContexts in the same driver program now.


 You sure about that? According to
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-spark-context-in-local-mode-thread-safe-td7275.html
 (June 2014), you currently can’t have multiple SparkContext objects in the
 same JVM.

 Tobias

Native / C/C++ code integration

2014-11-07 Thread Paul Wais

Dear List,

Has anybody had experience integrating C/C++ code into Spark jobs?

I have done some work on this topic using JNA. I wrote a FlatMapFunction
that processes all partition entries using a C++ library. This approach
works well, but there are some tradeoffs:
* Shipping the native dylib with the app jar and loading it at runtime
requires a bit of work (on top of normal JNA usage)
* Native code doesn't respect the executor heap limits. Under heavy memory
pressure, the native code can sometimes ENOMEM sporadically.
* While JNA can map Strings, structs, and Java primitive types, the user
still needs to deal with more complex objects. E.g. re-serialize
protobuf/thrift objects, or provide some other encoding for moving data
between Java and C/C++.
* C++ static is not thread-safe before C++11, so the user sometimes needs
to take care running inside multi-threaded executors
* Avoiding memory copies can be a little tricky

One other alternative approach comes to mind is pipe(). However, PipedRDD
requires copying data over pipes, does not support binary data (?), and
native code errors that crash the subprocess don't bubble up to the Spark
job as nicely as with JNA.

Is there a way to expose raw, in-memory partition/block data to native code?

Has anybody else attacked this problem a different way?

All the best,
-Paul

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sql - group by on UDF not working

2014-11-07 Thread Shixiong Zhu

Now it doesn't support such query. I can easily reproduce it. Created a
JIRA here: https://issues.apache.org/jira/browse/SPARK-4296

Best Regards,
Shixiong Zhu

2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com:

 I am trying to group by on a calculated field. Is it supported on spark
 sql? I am running it on a nested json structure.

 Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim
 c group by YEAR(c.Patient.DOB)

 Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.
 Error:

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression
 not in GROUP BY:
 HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS
 DOB#191) AS c_0#185, tree:
 Aggregate
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)],
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS
 DOB#191) AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192,
 LongType)) AS c_1#186L]
  Subquery c
   Subquery claim
LogicalRDD
 [AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14],
 MappedRDD[5] at map at JsonRDD.scala:43
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
 at
 scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)
 at $iwC$$iwC$$iwC$$iwC.init(console:17)
 at $iwC$$iwC$$iwC.init(console:22)
 at $iwC$$iwC.init(console:24)
 at $iwC.init(console:26)
 at init(console:28)
 at .init(console:32)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

Re: LZO support in Spark 1.0.0 - nothing seems to work

2014-11-07 Thread Sree Harsha

@rogthefrog

Were you able to figure out how to fix this issue? 
Even I tried all combinations that possible but no luck yet.

Thanks,
Harsha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LZO-support-in-Spark-1-0-0-nothing-seems-to-work-tp14494p18349.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

MESOS slaves shut down due to 'health check timed out

2014-11-07 Thread Yangcheng Huang

Hi guys

Do you know how to handle the following case -

= From MESOS log file =
Slave asked to shut down by master@:5050 because 'health
check timed out'
I1107 17:33:20.860988 27573 slave.cpp:1337] Asked to shut down framework 
===

Any configurations to increase this timeout interval?

Thanks
YC

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Store DStreams into Hive using Hive Streaming

2014-11-07 Thread Luiz Geovani Vier

Hi Ted and Silvio, thanks for your responses.

Hive has a new API for streaming (
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
that takes care of compaction and doesn't require any downtime for the
table. The data is immediately available and Hive will combine files in
background transparently. I was hoping to use this API from within Spark to
mitigate the issue with lots of small files...

Here's my equivalent code for Trident (work in progress):
https://gist.github.com/lgvier/ee28f1c95ac4f60efc3e
Trident will coordinate the transaction and send all the tuples from each
server/partition to your component at once (Stream.partitionPersist). That
is very helpful since Hive expects batches of records instead of one call
for each record.
I had a look at foreachRDD but it seems to be invoked for each record. I'd
like to get all the Stream's records on each server/partition at once.
For example, if the stream was processed by 3 servers and resulted in 100
records on each server, I'd like to receive 3 calls (one on each server),
each with 100 records. Please let me know if I'm making any sense. I'm
fairly new to Spark.

Thank you,
-Geovani


-Geovani

On Thu, Nov 6, 2014 at 9:54 PM, Silvio Fiorito 
silvio.fior...@granturing.com wrote:

  Geovani,

  You can use HiveContext to do inserts into a Hive table in a Streaming
 app just as you would a batch app. A DStream is really a collection of RDDs
 so you can run the insert from within the foreachRDD. You just have to be
 careful that you’re not creating large amounts of small files. So you may
 want to either increase the duration of your Streaming batches or
 repartition right before you insert. You’ll just need to do some testing
 based on your ingest volume. You may also want to consider streaming into
 another data store though.

  Thanks,
 Silvio

   From: Luiz Geovani Vier lgv...@gmail.com
 Date: Thursday, November 6, 2014 at 7:46 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Store DStreams into Hive using Hive Streaming

   Hello,

 Is there a built-in way or connector to store DStream results into an
 existing Hive ORC table using the Hive/HCatalog Streaming API?
 Otherwise, do you have any suggestions regarding the implementation of
 such component?

 Thank you,
  -Geovani

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh

you're right, serialization works.

what is your suggestion on saving a distributed model? so part of the
model is in one cluster, and some other parts of the model are in other
clusters. during runtime, these sub-models run independently in their own
clusters (load, train, save). and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.

much appreciated.

On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:

There's some work going on to support PMML -
https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
merged into master.

What are you used to doing in other environments? In R I'm used to running
save(), same with matlab. In python either pickling things or dumping to
json seems pretty common. (even the scikit-learn docs recommend pickling -
http://scikit-learn.org/stable/modules/model_persistence.html). These all
seem basically equivalent java serialization to me..

Would some helper functions (in, say, mllib.util.modelpersistence or
something) make sense to add?

On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
wrote:

that works. is there a better way in spark? this seems like the most
common feature for any machine learning work - to be able to save your
model after training it and load it later.

On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:

Plain old java serialization is one straightforward approach if you're
in java/scala.

On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

what is the best way to save an mllib model that you just trained and
reload
it in the future? specifically, i'm using the mllib word2vec model...
thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

error when importing HiveContext

2014-11-07 Thread Pagliari, Roberto

I'm getting this error when importing hive context

 from pyspark.sql import HiveContext
Traceback (most recent call last):
  File stdin, line 1, in module
  File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module
from pyspark.context import SparkContext
  File /path/spark-1.1.0/python/pyspark/context.py, line 30, in module
from pyspark.java_gateway import launch_gateway
  File /path/spark-1.1.0/python/pyspark/java_gateway.py, line 26, in module
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

I cannot find py4j on my system. Where is it?

Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh

thanks reza.  i'm not familiar with the block matrix multiplication, but
is it a good fit for very large dimension, but extremely sparse matrix?

if not, what is your recommendation on implementing matrix multiplication
in spark on very large dimension, but extremely sparse matrix?




On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh r...@databricks.com wrote:

 See this thread for examples of sparse matrix x sparse matrix:
 https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA

 We thought about providing matrix multiplies on CoordinateMatrix, however,
 the matrices have to be very dense for the overhead of having many little
 (i, j, value) objects to be worth it. For this reason, we are focused on
 doing block matrix multiplication first. The goal is version 1.3.

 Best,
 Reza

 On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan w...@us.ibm.com wrote:

 I think Xiangrui's ALS code implement certain aspect of it. You may want
 to check it out.
 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center


 [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
 PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
 Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
 matrix multiplication and then define an RDD of sub-matri

 From: Xiangrui Meng men...@gmail.com
 To: Duy Huynh duy.huynh@gmail.com
 Cc: user u...@spark.incubator.apache.org
 Date: 11/05/2014 01:13 PM
 Subject: Re: sparse x sparse matrix multiplication
 --



 You can use breeze for local sparse-sparse matrix multiplication and
 then define an RDD of sub-matrices

 RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)

 and then use join and aggregateByKey to implement this feature, which
 is the same as in MapReduce.

 -Xiangrui

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath

Currently I see the word2vec model is collected onto the master, so the model
itself is not distributed.

I guess the question is why do you need a distributed model? Is the vocab size
so large that it's necessary? For model serving in general, unless the model is
truly massive (ie cannot fit into memory on a modern high end box with 64, or
128GB ram) then single instance is way faster and simpler (using a cluster of
machines is more for load balancing / fault tolerance).

What is your use case for model serving?

—
Sent from Mailbox

On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote:

you're right, serialization works.
what is your suggestion on saving a distributed model? so part of the
model is in one cluster, and some other parts of the model are in other
clusters. during runtime, these sub-models run independently in their own
clusters (load, train, save). and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.
much appreciated.
On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:
There's some work going on to support PMML -
https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
merged into master.