Re: Bug in Accumulators...

2014-11-07 Thread Shixiong Zhu
Could you provide all pieces of codes which can reproduce the bug? Here is
my test code:

import org.apache.spark._
import org.apache.spark.SparkContext._

object SimpleApp {

  def main(args: Array[String]) {
val conf = new SparkConf().setAppName(SimpleApp)
val sc = new SparkContext(conf)

val accum = sc.accumulator(0)
for (i - 1 to 10) {
  sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
}
sc.stop()
  }
}

It works fine both in client and cluster. Since this is a serialization
bug, the outer class does matter. Could you provide it? Is there
a SparkContext field in the outer class?

Best Regards,
Shixiong Zhu

2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch:

 I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works
 if I
 run it in local mode! )

 If I put the accumulator inside the for loop, everything will work fine. I
 guess the bug is that an accumulator can be applied to JUST one RDD.

 Still another undocumented 'feature' of Spark that no one from the people
 who maintain Spark is willing to solve or at least to tell us about ...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Bug in Accumulators...

2014-11-07 Thread Aaron Davidson
This may be due in part to Scala allocating an anonymous inner class in
order to execute the for loop. I would expect if you change it to a while
loop like

var i = 0
while (i  10) {
  sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
  i += 1
}

then the problem may go away. I am not super familiar with the closure
cleaner, but I believe that we cannot prune beyond 1 layer of references,
so the extra class of nesting may be screwing something up. If this is the
case, then I would also expect replacing the accumulator with any other
reference to the enclosing scope (such as a broadcast variable) would have
the same result.

On Fri, Nov 7, 2014 at 12:03 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 Could you provide all pieces of codes which can reproduce the bug? Here is
 my test code:

 import org.apache.spark._
 import org.apache.spark.SparkContext._

 object SimpleApp {

   def main(args: Array[String]) {
 val conf = new SparkConf().setAppName(SimpleApp)
 val sc = new SparkContext(conf)

 val accum = sc.accumulator(0)
 for (i - 1 to 10) {
   sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x)
 }
 sc.stop()
   }
 }

 It works fine both in client and cluster. Since this is a serialization
 bug, the outer class does matter. Could you provide it? Is there
 a SparkContext field in the outer class?

 Best Regards,
 Shixiong Zhu

 2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch:

 I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works
 if I
 run it in local mode! )

 If I put the accumulator inside the for loop, everything will work fine. I
 guess the bug is that an accumulator can be applied to JUST one RDD.

 Still another undocumented 'feature' of Spark that no one from the people
 who maintain Spark is willing to solve or at least to tell us about ...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





RE: CheckPoint Issue with JsonRDD

2014-11-07 Thread Jahagirdar, Madhu
Michael any idea on this?

From: Jahagirdar, Madhu
Sent: Thursday, November 06, 2014 2:36 PM
To: mich...@databricks.com; user
Subject: CheckPoint Issue with JsonRDD

When we enable checkpoint and use JsonRDD we get the following error: Is this 
bug ?


Exception in thread main java.lang.NullPointerException
at org.apache.spark.rdd.RDD.init(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
at 
org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132)
at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69)
at 
SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

=

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{Logging, SparkConf}
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.hive.api.java.JavaHiveContext
import org.apache.spark.streaming.api.java.JavaStreamingContext
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}


object SparkStreamingToParquet extends Logging {


  /**
   *
   * @param args
   * @throws Exception
   */
  def main(args: Array[String]) {
if (args.length  3) {
  logInfo(Please provide valid parameters: hdfsFilesLocation: 
hdfs://ip:8020/user/hdfs/--/ IMPALAtableloc hdfs://ip:8020/user/hive/--/ 
tablename)
  logInfo(make user you give full folder path with '/' at the end i.e 
/user/hdfs/abc/)
  System.exit(1)
}
val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, 
()={
  createContext(args)
})

jssc.start
jssc.awaitTermination
  }


  def createContext(args:Array[String]): StreamingContext = {

val HDFS_FILE_LOC = args(0)
val IMPALA_TABLE_LOC  = args(1)
val TEMP_TABLE_NAME = args(2)
val CHECKPOINT_DIR = args(3)

val sparkConf: SparkConf = new SparkConf().setAppName(Json to 
Parquet).set(spark.cores.max, 3)

val jssc: StreamingContext = new StreamingContext(sparkConf, new 
Duration(3))

val hivecontext: HiveContext = new HiveContext(jssc.sparkContext)

hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME);

val schemaString = name age
val schema =
  StructType(
schemaString.split( ).map(fieldName = StructField(fieldName, 
StringType, true)))

val textFileStream = jssc.textFileStream(HDFS_FILE_LOC)

textFileStream.foreachRDD(rdd = {
  if(rdd !=null  rdd.count()0) {
  val schRdd =  hivecontext.jsonRDD(rdd,schema)
  logInfo(inserting into table:  + TEMP_TABLE_NAME)
  schRdd.insertInto(TEMP_TABLE_NAME)
  }
})
jssc.checkpoint(CHECKPOINT_DIR)
jssc
  }
}



case class Person(name:String, age:String) extends Serializable

Regards,
Madhu jahagirdar


The information contained in this message may be confidential and legally 
protected under applicable law. The message is intended solely for the 
addressee(s). If you are not the intended recipient, you are hereby notified 
that any use, forwarding, dissemination, or reproduction of this message is 
strictly prohibited and may be unlawful. If you are not the intended recipient, 
please contact the sender by return e-mail and destroy all copies of the 
original message.

-
To unsubscribe, 

sql - group by on UDF not working

2014-11-07 Thread Tridib Samanta
I am trying to group by on a calculated field. Is it supported on spark sql? I 
am running it on a nested json structure.
 
Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c 
group by YEAR(c.Patient.DOB)
 
Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.
Error: 
 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB 
AS DOB#191) AS c_0#185, tree:
Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)], 
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS DOB#191) 
AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192, LongType)) AS 
c_1#186L]
 Subquery c
  Subquery claim
   LogicalRDD 
[AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14],
 MappedRDD[5] at map at JsonRDD.scala:43
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)
at $iwC$$iwC$$iwC$$iwC.init(console:17)
at $iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC.init(console:24)
at $iwC.init(console:26)
at init(console:28)
at .init(console:32)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at 

about write mongodb in mapPartitions

2014-11-07 Thread qinwei






Hi, everyone
    I come across with a prolem about writing data to mongodb in mapPartitions, 
my code is as below:                 val sourceRDD = 
sc.textFile(hdfs://host:port/sourcePath)          // some transformations     
   val rdd= sourceRDD .map(mapFunc).filter(filterFunc)        val newRDD = 
rdd.mapPartitions(args = {             val mongoClient = new 
MongoClient(host, port) 
            val db = mongoClient.getDB(db) 
            val coll = db.getCollection(collectionA) 

            args.map(arg = { 
                coll.insert(new BasicDBObject(pkg, arg)) 
                arg 
    }) 

            mongoClient.close() 
            args 
        })            newRDD.saveAsTextFile(hdfs://host:port/path)        The 
application saved data to HDFS correctly, but not mongodb, is there someting 
wrong?    I know that collecting the newRDD to driver and then saving it to 
mongodb will success, but will the following saveAsTextFile read the filesystem 
once again?
    Thanks    

qinwei



Re: about write mongodb in mapPartitions

2014-11-07 Thread Akhil Das
Why not saveAsNewAPIHadoopFile?


//Define your mongoDB confs

val config = new Configuration()

 config.set(mongo.output.uri, mongodb://
127.0.0.1:27017/sigmoid.output)

//Write everything to mongo
 rdd.saveAsNewAPIHadoopFile(file:///some/random, classOf[Any],
classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]],
config)


Thanks
Best Regards

On Fri, Nov 7, 2014 at 2:53 PM, qinwei wei@dewmobile.net wrote:

 Hi, everyone

 I come across with a prolem about writing data to mongodb in
 mapPartitions, my code is as below:

  val sourceRDD = sc.textFile(hdfs://host:port/sourcePath)
   // some transformations
 val rdd= sourceRDD .map(mapFunc).filter(filterFunc)
 val newRDD = rdd.mapPartitions(args = {
 val mongoClient = new MongoClient(host, port)
 val db = mongoClient.getDB(db)
 val coll = db.getCollection(collectionA)

 args.map(arg = {
 coll.insert(new BasicDBObject(pkg, arg))
 arg
 })

 mongoClient.close()
 args
 })

 newRDD.saveAsTextFile(hdfs://host:port/path)

 The application saved data to HDFS correctly, but not mongodb, is
 there someting wrong?
 I know that collecting the newRDD to driver and then saving it to
 mongodb will success, but will the following saveAsTextFile read the
 filesystem once again?

 Thanks


 --
 qinwei



Re: multiple spark context in same driver program

2014-11-07 Thread Akhil Das
My bad, I just fired up a spark-shell and created a new sparkContext and it
was working fine. I basically did a parallelize and collect with both
sparkContexts.

Thanks
Best Regards

On Fri, Nov 7, 2014 at 3:17 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Fri, Nov 7, 2014 at 4:58 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 That doc was created during the initial days (Spark 0.8.0), you can of
 course create multiple sparkContexts in the same driver program now.


 You sure about that? According to
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-spark-context-in-local-mode-thread-safe-td7275.html
 (June 2014), you currently can’t have multiple SparkContext objects in the
 same JVM.

 Tobias





Native / C/C++ code integration

2014-11-07 Thread Paul Wais
Dear List,

Has anybody had experience integrating C/C++ code into Spark jobs?  

I have done some work on this topic using JNA.  I wrote a FlatMapFunction
that processes all partition entries using a C++ library.  This approach
works well, but there are some tradeoffs:
 * Shipping the native dylib with the app jar and loading it at runtime
requires a bit of work (on top of normal JNA usage)
 * Native code doesn't respect the executor heap limits.  Under heavy memory
pressure, the native code can sometimes ENOMEM sporadically.
 * While JNA can map Strings, structs, and Java primitive types, the user
still needs to deal with more complex objects.  E.g. re-serialize
protobuf/thrift objects, or provide some other encoding for moving data
between Java and C/C++.
 * C++ static is not thread-safe before C++11, so the user sometimes needs
to take care running inside multi-threaded executors
 * Avoiding memory copies can be a little tricky

One other alternative approach comes to mind is pipe().  However, PipedRDD
requires copying data over pipes, does not support binary data (?), and
native code errors that crash the subprocess don't bubble up to the Spark
job as nicely as with JNA.

Is there a way to expose raw, in-memory partition/block data to native code?

Has anybody else attacked this problem a different way?

All the best,
-Paul 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sql - group by on UDF not working

2014-11-07 Thread Shixiong Zhu
Now it doesn't support such query. I can easily reproduce it. Created a
JIRA here: https://issues.apache.org/jira/browse/SPARK-4296

Best Regards,
Shixiong Zhu

2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com:

 I am trying to group by on a calculated field. Is it supported on spark
 sql? I am running it on a nested json structure.

 Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim
 c group by YEAR(c.Patient.DOB)

 Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4.
 Error:

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression
 not in GROUP BY:
 HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS
 DOB#191) AS c_0#185, tree:
 Aggregate
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)],
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS
 DOB#191) AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192,
 LongType)) AS c_1#186L]
  Subquery c
   Subquery claim
LogicalRDD
 [AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14],
 MappedRDD[5] at map at JsonRDD.scala:43
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
 at
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 at
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
 at
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
 at
 scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423)
 at $iwC$$iwC$$iwC$$iwC.init(console:17)
 at $iwC$$iwC$$iwC.init(console:22)
 at $iwC$$iwC.init(console:24)
 at $iwC.init(console:26)
 at init(console:28)
 at .init(console:32)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 

Re: LZO support in Spark 1.0.0 - nothing seems to work

2014-11-07 Thread Sree Harsha
@rogthefrog

Were you able to figure out how to fix this issue? 
Even I tried all combinations that possible but no luck yet.

Thanks,
Harsha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LZO-support-in-Spark-1-0-0-nothing-seems-to-work-tp14494p18349.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MESOS slaves shut down due to 'health check timed out

2014-11-07 Thread Yangcheng Huang
Hi guys

Do you know how to handle the following case -

= From MESOS log file =
Slave asked to shut down by master@:5050 because 'health
check timed out'
I1107 17:33:20.860988 27573 slave.cpp:1337] Asked to shut down framework 
===

Any configurations to increase this timeout interval?

Thanks
YC

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Store DStreams into Hive using Hive Streaming

2014-11-07 Thread Luiz Geovani Vier
Hi Ted and Silvio, thanks for your responses.

Hive has a new API for streaming (
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
that takes care of compaction and doesn't require any downtime for the
table. The data is immediately available and Hive will combine files in
background transparently. I was hoping to use this API from within Spark to
mitigate the issue with lots of small files...

Here's my equivalent code for Trident (work in progress):
https://gist.github.com/lgvier/ee28f1c95ac4f60efc3e
Trident will coordinate the transaction and send all the tuples from each
server/partition to your component at once (Stream.partitionPersist). That
is very helpful since Hive expects batches of records instead of one call
for each record.
I had a look at foreachRDD but it seems to be invoked for each record. I'd
like to get all the Stream's records on each server/partition at once.
For example, if the stream was processed by 3 servers and resulted in 100
records on each server, I'd like to receive 3 calls (one on each server),
each with 100 records. Please let me know if I'm making any sense. I'm
fairly new to Spark.

Thank you,
-Geovani


-Geovani

On Thu, Nov 6, 2014 at 9:54 PM, Silvio Fiorito 
silvio.fior...@granturing.com wrote:

  Geovani,

  You can use HiveContext to do inserts into a Hive table in a Streaming
 app just as you would a batch app. A DStream is really a collection of RDDs
 so you can run the insert from within the foreachRDD. You just have to be
 careful that you’re not creating large amounts of small files. So you may
 want to either increase the duration of your Streaming batches or
 repartition right before you insert. You’ll just need to do some testing
 based on your ingest volume. You may also want to consider streaming into
 another data store though.

  Thanks,
 Silvio

   From: Luiz Geovani Vier lgv...@gmail.com
 Date: Thursday, November 6, 2014 at 7:46 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Store DStreams into Hive using Hive Streaming

   Hello,

 Is there a built-in way or connector to store DStream results into an
 existing Hive ORC table using the Hive/HCatalog Streaming API?
 Otherwise, do you have any suggestions regarding the implementation of
 such component?

 Thank you,
  -Geovani



Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
you're right, serialization works.

what is your suggestion on saving a distributed model?  so part of the
model is in one cluster, and some other parts of the model are in other
clusters.  during runtime, these sub-models run independently in their own
clusters (load, train, save).  and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.

much appreciated.



On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
 merged into master.

 What are you used to doing in other environments? In R I'm used to running
 save(), same with matlab. In python either pickling things or dumping to
 json seems pretty common. (even the scikit-learn docs recommend pickling -
 http://scikit-learn.org/stable/modules/model_persistence.html). These all
 seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







error when importing HiveContext

2014-11-07 Thread Pagliari, Roberto
I'm getting this error when importing hive context

 from pyspark.sql import HiveContext
Traceback (most recent call last):
  File stdin, line 1, in module
  File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module
from pyspark.context import SparkContext
  File /path/spark-1.1.0/python/pyspark/context.py, line 30, in module
from pyspark.java_gateway import launch_gateway
  File /path/spark-1.1.0/python/pyspark/java_gateway.py, line 26, in module
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway

I cannot find py4j on my system. Where is it?


Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh
thanks reza.  i'm not familiar with the block matrix multiplication, but
is it a good fit for very large dimension, but extremely sparse matrix?

if not, what is your recommendation on implementing matrix multiplication
in spark on very large dimension, but extremely sparse matrix?




On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh r...@databricks.com wrote:

 See this thread for examples of sparse matrix x sparse matrix:
 https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA

 We thought about providing matrix multiplies on CoordinateMatrix, however,
 the matrices have to be very dense for the overhead of having many little
 (i, j, value) objects to be worth it. For this reason, we are focused on
 doing block matrix multiplication first. The goal is version 1.3.

 Best,
 Reza

 On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan w...@us.ibm.com wrote:

 I think Xiangrui's ALS code implement certain aspect of it. You may want
 to check it out.
 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center


 [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
 PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
 Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
 matrix multiplication and then define an RDD of sub-matri

 From: Xiangrui Meng men...@gmail.com
 To: Duy Huynh duy.huynh@gmail.com
 Cc: user u...@spark.incubator.apache.org
 Date: 11/05/2014 01:13 PM
 Subject: Re: sparse x sparse matrix multiplication
 --



 You can use breeze for local sparse-sparse matrix multiplication and
 then define an RDD of sub-matrices

 RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)

 and then use join and aggregateByKey to implement this feature, which
 is the same as in MapReduce.

 -Xiangrui

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
Currently I see the word2vec model is collected onto the master, so the model 
itself is not distributed. 


I guess the question is why do you need  a distributed model? Is the vocab size 
so large that it's necessary? For model serving in general, unless the model is 
truly massive (ie cannot fit into memory on a modern high end box with 64, or 
128GB ram) then single instance is way faster and simpler (using a cluster of 
machines is more for load balancing / fault tolerance).




What is your use case for model serving?


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote:

 you're right, serialization works.
 what is your suggestion on saving a distributed model?  so part of the
 model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.
 much appreciated.
 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:
 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
 merged into master.

 What are you used to doing in other environments? In R I'm used to running
 save(), same with matlab. In python either pickling things or dumping to
 json seems pretty common. (even the scikit-learn docs recommend pickling -
 http://scikit-learn.org/stable/modules/model_persistence.html). These all
 seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Evan R. Sparks
There are a few examples where this is the case. Let's take ALS, where the
result is a MatrixFactorizationModel, which is assumed to be big - the
model consists of two matrices, one (users x k) and one (k x products).
These are represented as RDDs.

You can save these RDDs out to disk by doing something like

model.userFeatures.saveAsObjectFile(...) and
model.productFeatures.saveAsObjectFile(...)

to save out to HDFS or Tachyon or S3.

Then, when you want to reload you'd have to instantiate them into a class
of MatrixFactorizationModel. That class is package private to MLlib right
now, so you'd need to copy the logic over to a new class, but that's the
basic idea.

That said - using spark to serve these recommendations on a point-by-point
basis might not be optimal. There's some work going on in the AMPLab to
address this issue.

On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote:

 you're right, serialization works.

 what is your suggestion on saving a distributed model?  so part of the
 model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
 merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
 These all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








where is the org.apache.spark.util package?

2014-11-07 Thread ll
i'm trying to compile some of the spark code directly from the source
(https://github.com/apache/spark).  it complains about the missing package
org.apache.spark.util.  it doesn't look like this package is part of the
source code on github. 

where can i find this package?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-is-the-org-apache-spark-util-package-tp18360.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath
For ALS if you want real time recs (and usually this is order 10s to a few 100s 
ms response), then Spark is not the way to go - a serving layer like Oryx, or 
prediction.io is what you want.


(At graphflow we've built our own).




You hold the factor matrices in memory and do the dot product in real time 
(with optional caching). Again, even for huge models (10s of millions 
users/items) this can be handled on a single, powerful instance. The issue at 
this scale is winnowing down the search space using LSH or similar approach to 
get to real time speeds.




For word2vec it's pretty much the same thing as what you have is very similar 
to one of the ALS factor matrices.




One problem is you can't access the wors2vec vectors as they are private val. I 
think this should be changed actually, so that just the word vectors could be 
saved and used in a serving layer.


—
Sent from Mailbox

On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

 There are a few examples where this is the case. Let's take ALS, where the
 result is a MatrixFactorizationModel, which is assumed to be big - the
 model consists of two matrices, one (users x k) and one (k x products).
 These are represented as RDDs.
 You can save these RDDs out to disk by doing something like
 model.userFeatures.saveAsObjectFile(...) and
 model.productFeatures.saveAsObjectFile(...)
 to save out to HDFS or Tachyon or S3.
 Then, when you want to reload you'd have to instantiate them into a class
 of MatrixFactorizationModel. That class is package private to MLlib right
 now, so you'd need to copy the logic over to a new class, but that's the
 basic idea.
 That said - using spark to serve these recommendations on a point-by-point
 basis might not be optimal. There's some work going on in the AMPLab to
 address this issue.
 On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote:
 you're right, serialization works.

 what is your suggestion on saving a distributed model?  so part of the
 model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
 merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
 These all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: where is the org.apache.spark.util package?

2014-11-07 Thread ll
i found util package under spark core package, but i now got this error
Sysmbol Utils is inaccessible from this place.  

what does this error mean?

the org.apache.spark.util and org.apache.spark.spark.Utils are there now.

thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-is-the-org-apache-spark-util-package-tp18360p18361.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: deploying a model built in mllib

2014-11-07 Thread chirag lakhani
Thanks for letting me know about this, it looks pretty interesting.  From
reading the documentation it seems that the server must be built on a Spark
cluster, is that correct?  Is it possible to deploy it in on a Java
server?  That is how we are currently running our web app.



On Tue, Nov 4, 2014 at 7:57 PM, Simon Chan simonc...@gmail.com wrote:

 The latest version of PredictionIO, which is now under Apache 2 license,
 supports the deployment of MLlib models on production.

 The engine you build will including a few components, such as:
 - Data - includes Data Source and Data Preparator
 - Algorithm(s)
 - Serving
 I believe that you can do the feature vector creation inside the Data
 Preparator component.

 Currently, the package comes with two templates: 1)  Collaborative
 Filtering Engine Template - with MLlib ALS; 2) Classification Engine
 Template - with MLlib Naive Bayes. The latter one may be useful to you. And
 you can customize the Algorithm component, too.

 I have just created a doc: http://docs.prediction.io/0.8.1/templates/
 Love to hear your feedback!

 Regards,
 Simon



 On Mon, Oct 27, 2014 at 11:03 AM, chirag lakhani chirag.lakh...@gmail.com
  wrote:

 Would pipelining include model export?  I didn't see that in the
 documentation.

 Are there ways that this is being done currently?



 On Mon, Oct 27, 2014 at 12:39 PM, Xiangrui Meng men...@gmail.com wrote:

 We are working on the pipeline features, which would make this
 procedure much easier in MLlib. This is still a WIP and the main JIRA
 is at:

 https://issues.apache.org/jira/browse/SPARK-1856

 Best,
 Xiangrui

 On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
 chirag.lakh...@gmail.com wrote:
  Hello,
 
  I have been prototyping a text classification model that my company
 would
  like to eventually put into production.  Our technology stack is
 currently
  Java based but we would like to be able to build our models in
 Spark/MLlib
  and then export something like a PMML file which can be used for model
  scoring in real-time.
 
  I have been using scikit learn where I am able to take the training
 data
  convert the text data into a sparse data format and then take the other
  features and use the dictionary vectorizer to do one-hot encoding for
 the
  other categorical variables.  All of those things seem to be possible
 in
  mllib but I am still puzzled about how that can be packaged in such a
 way
  that the incoming data can be first made into feature vectors and then
  evaluated as well.
 
  Are there any best practices for this type of thing in Spark?  I hope
 this
  is clear but if there are any confusions then please let me know.
 
  Thanks,
 
  Chirag






Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
hi nick.. sorry about the confusion.  originally i had a question
specifically about word2vec, but my follow up question on distributed model
is a more general question about saving different types of models.

on distributed model, i was hoping to implement a model parallelism, so
that different workers can work on different parts of the models, and then
merge the results at the end at the single master model.



On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Currently I see the word2vec model is collected onto the master, so the
 model itself is not distributed.

 I guess the question is why do you need  a distributed model? Is the vocab
 size so large that it's necessary? For model serving in general, unless the
 model is truly massive (ie cannot fit into memory on a modern high end box
 with 64, or 128GB ram) then single instance is way faster and simpler
 (using a cluster of machines is more for load balancing / fault tolerance).

 What is your use case for model serving?

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote:

 you're right, serialization works.

 what is your suggestion on saving a distributed model?  so part of the
 model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
 been merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
 These all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org









Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
yep, but that's only if they are already represented as RDDs. which is much
more convenient for saving and loading.

my question is for the use case that they are not represented as RDDs yet.

then, do you think if it makes sense to covert them into RDDs, just for the
convenience of saving and loading them distributedly?

On Fri, Nov 7, 2014 at 12:36 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

 There are a few examples where this is the case. Let's take ALS, where the
 result is a MatrixFactorizationModel, which is assumed to be big - the
 model consists of two matrices, one (users x k) and one (k x products).
 These are represented as RDDs.

 You can save these RDDs out to disk by doing something like

 model.userFeatures.saveAsObjectFile(...) and
 model.productFeatures.saveAsObjectFile(...)

 to save out to HDFS or Tachyon or S3.

 Then, when you want to reload you'd have to instantiate them into a class
 of MatrixFactorizationModel. That class is package private to MLlib right
 now, so you'd need to copy the logic over to a new class, but that's the
 basic idea.

 That said - using spark to serve these recommendations on a point-by-point
 basis might not be optimal. There's some work going on in the AMPLab to
 address this issue.

 On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote:

 you're right, serialization works.

 what is your suggestion on saving a distributed model?  so part of the
 model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
 been merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
 These all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if you're
 in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained and
 reload
 it in the future?  specifically, i'm using the mllib word2vec model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org









Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
thansk nick.  i'll take a look at oryx and prediction.io.

re: private val model in word2vec ;) yes, i couldn't wait so i just changed
it in the word2vec source code.  but i'm running into some compiliation
issue now.  hopefully i can fix it soon, so to get this things going.

On Fri, Nov 7, 2014 at 12:52 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 For ALS if you want real time recs (and usually this is order 10s to a few
 100s ms response), then Spark is not the way to go - a serving layer like
 Oryx, or prediction.io is what you want.

 (At graphflow we've built our own).

 You hold the factor matrices in memory and do the dot product in real time
 (with optional caching). Again, even for huge models (10s of millions
 users/items) this can be handled on a single, powerful instance. The issue
 at this scale is winnowing down the search space using LSH or similar
 approach to get to real time speeds.

 For word2vec it's pretty much the same thing as what you have is very
 similar to one of the ALS factor matrices.

 One problem is you can't access the wors2vec vectors as they are private
 val. I think this should be changed actually, so that just the word vectors
 could be saved and used in a serving layer.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There are a few examples where this is the case. Let's take ALS, where
 the result is a MatrixFactorizationModel, which is assumed to be big - the
 model consists of two matrices, one (users x k) and one (k x products).
 These are represented as RDDs.

 You can save these RDDs out to disk by doing something like

 model.userFeatures.saveAsObjectFile(...) and
 model.productFeatures.saveAsObjectFile(...)

 to save out to HDFS or Tachyon or S3.

 Then, when you want to reload you'd have to instantiate them into a class
 of MatrixFactorizationModel. That class is package private to MLlib right
 now, so you'd need to copy the logic over to a new class, but that's the
 basic idea.

 That said - using spark to serve these recommendations on a
 point-by-point basis might not be optimal. There's some work going on in
 the AMPLab to address this issue.

 On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com
 wrote:

 you're right, serialization works.

 what is your suggestion on saving a distributed model?  so part of the
 model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
 been merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling -
 http://scikit-learn.org/stable/modules/model_persistence.html). These
 all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the most
 common feature for any machine learning work - to be able to save your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 Plain old java serialization is one straightforward approach if
 you're in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained
 and reload
 it in the future?  specifically, i'm using the mllib word2vec
 model...
 thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org










Re: AVRO specific records

2014-11-07 Thread Simone Franzini
Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that
I have not fully solved yet. I am able to run with Hadoop1 and AVRO in
standalone mode but not with Hadoop2 (even after trying to fix the
dependencies).

Anyway, I am now trying to write to AVRO, using a very similar snippet to
the one to read from AVRO:

val withValues : RDD[(AvroKey[Subscriber], NullWritable)] = records.map{s
= (new AvroKey(s), NullWritable.get)}
val outPath = myOutputPath
val writeJob = new Job()
FileOutputFormat.setOutputPath(writeJob, new Path(outPath))
AvroJob.setOutputKeySchema(writeJob, Subscriber.getClassSchema())
writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Any]])
records.saveAsNewAPIHadoopFile(outPath,
classOf[AvroKey[Subscriber]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[Subscriber]],
writeJob.getConfiguration)

Now, my problem is that this writes to a plain text file. I need to write
to binary AVRO. What am I missing?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Thu, Nov 6, 2014 at 3:15 PM, Simone Franzini captainfr...@gmail.com
wrote:

 Benjamin,

 Thanks for the snippet. I have tried using it, but unfortunately I get the
 following exception. I am clueless at what might be wrong. Any ideas?

 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)


 Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Wed, Nov 5, 2014 at 4:24 PM, Laird, Benjamin 
 benjamin.la...@capitalone.com wrote:

 Something like this works and is how I create an RDD of specific records.

 val avroRdd = sc.newAPIHadoopFile(twitter.avro,
 classOf[AvroKeyInputFormat[twitter_schema]],
 classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From
 https://github.com/julianpeeters/avro-scala-macro-annotation-examples/blob/master/spark/src/main/scala/AvroSparkScala.scala)
 Keep in mind you'll need to use the kryo serializer as well.

 From: Frank Austin Nothaft fnoth...@berkeley.edu
 Date: Wednesday, November 5, 2014 at 5:06 PM
 To: Simone Franzini captainfr...@gmail.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: AVRO specific records

 Hi Simone,

 Matt Massie put together a good tutorial on his blog
 http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. If you’re
 looking for more code using Avro, we use it pretty extensively in our
 genomics project. Our Avro schemas are here
 https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl,
 and we have serialization code here
 https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization.
 We use Parquet for storing the Avro records, but there is also an Avro
 HadoopInputFormat.

 Regards,

 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466

 On Nov 5, 2014, at 1:25 PM, Simone Franzini captainfr...@gmail.com
 wrote:

 How can I read/write AVRO specific records?
 I found several snippets using generic records, but nothing with specific
 records so far.

 Thanks,
 Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini



 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.





Re: sparse x sparse matrix multiplication

2014-11-07 Thread Reza Zadeh
If you're have very large and very sparse matrix represented as (i, j,
value) entries, then you can try the algorithms mentioned in the post
https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA brought
up earlier.

Reza

On Fri, Nov 7, 2014 at 8:31 AM, Duy Huynh duy.huynh@gmail.com wrote:

 thanks reza.  i'm not familiar with the block matrix multiplication, but
 is it a good fit for very large dimension, but extremely sparse matrix?

 if not, what is your recommendation on implementing matrix multiplication
 in spark on very large dimension, but extremely sparse matrix?




 On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh r...@databricks.com wrote:

 See this thread for examples of sparse matrix x sparse matrix:
 https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA

 We thought about providing matrix multiplies on CoordinateMatrix,
 however, the matrices have to be very dense for the overhead of having many
 little (i, j, value) objects to be worth it. For this reason, we are
 focused on doing block matrix multiplication first. The goal is version 1.3.

 Best,
 Reza

 On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan w...@us.ibm.com wrote:

 I think Xiangrui's ALS code implement certain aspect of it. You may want
 to check it out.
 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center


 [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
 PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
 Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
 matrix multiplication and then define an RDD of sub-matri

 From: Xiangrui Meng men...@gmail.com
 To: Duy Huynh duy.huynh@gmail.com
 Cc: user u...@spark.incubator.apache.org
 Date: 11/05/2014 01:13 PM
 Subject: Re: sparse x sparse matrix multiplication
 --



 You can use breeze for local sparse-sparse matrix multiplication and
 then define an RDD of sub-matrices

 RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)

 and then use join and aggregateByKey to implement this feature, which
 is the same as in MapReduce.

 -Xiangrui

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Simon Chan
Just want to elaborate more on Duy's suggestion on using PredictionIO.

PredictionIO will store the model automatically if you return it in the
training function.
An example using CF:

 def train(data: PreparedData): PersistentMatrixFactorizationModel = {
val m = ALS.train(data.ratings, ap.rank, ap.numIterations, ap.lambda)
new PersistentMatrixFactorizationModel(
  rank = m.rank,
  userFeatures = m.userFeatures,
  productFeatures = m.productFeatures)
  }


And the persisted model will be passed to the predict function when you
query for prediction:

def predict(
model: PersistentMatrixFactorizationModel,
query: Query): PredictedResult = {
val productScores = model.recommendProducts(query.user, query.num)
  .map (r = ProductScore(r.product, r.rating))
new PredictedResult(productScores)}



Some templates and tutorials for MLlib are here:
http://docs.prediction.io/0.8.1/templates/

Simon


On Fri, Nov 7, 2014 at 10:11 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 Sure - in theory this sounds great. But in practice it's much faster and a
 whole lot simpler to just serve the model from single instance in memory.
 Optionally you can multithread within that (as Oryx 1 does).

 There are very few real world use cases where the model is so large that
 it HAS to be distributed.

 Having said this, it's certainly possible to distribute model serving for
 factor-like models (like ALS). One idea I'm working on now is using
 Elasticsearch for exactly this purpose - but that more because I'm using it
 for filtering of recommendation results and combining with search, so
 overall it's faster to do it this way.

 For the pure matrix algebra part, single instance in memory is way faster.

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh duy.huynh@gmail.com wrote:

 hi nick.. sorry about the confusion.  originally i had a question
 specifically about word2vec, but my follow up question on distributed model
 is a more general question about saving different types of models.

 on distributed model, i was hoping to implement a model parallelism, so
 that different workers can work on different parts of the models, and then
 merge the results at the end at the single master model.



 On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com
  wrote:

 Currently I see the word2vec model is collected onto the master, so the
 model itself is not distributed.

 I guess the question is why do you need  a distributed model? Is the
 vocab size so large that it's necessary? For model serving in general,
 unless the model is truly massive (ie cannot fit into memory on a modern
 high end box with 64, or 128GB ram) then single instance is way faster and
 simpler (using a cluster of machines is more for load balancing / fault
 tolerance).

 What is your use case for model serving?

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 you're right, serialization works.

 what is your suggestion on saving a distributed model?  so part of
 the model is in one cluster, and some other parts of the model are in other
 clusters.  during runtime, these sub-models run independently in their own
 clusters (load, train, save).  and at some point during run time these
 sub-models merge into the master model, which also loads, trains, and saves
 at the master level.

 much appreciated.



 On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
 wrote:

 There's some work going on to support PMML -
 https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
 been merged into master.

 What are you used to doing in other environments? In R I'm used to
 running save(), same with matlab. In python either pickling things or
 dumping to json seems pretty common. (even the scikit-learn docs recommend
 pickling -
 http://scikit-learn.org/stable/modules/model_persistence.html). These
 all seem basically equivalent java serialization to me..

 Would some helper functions (in, say, mllib.util.modelpersistence or
 something) make sense to add?

 On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
 wrote:

 that works.  is there a better way in spark?  this seems like the
 most common feature for any machine learning work - to be able to save 
 your
 model after training it and load it later.

 On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
  wrote:

 Plain old java serialization is one straightforward approach if
 you're in java/scala.

 On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

 what is the best way to save an mllib model that you just trained
 and reload
 it in the future?  specifically, i'm using the mllib word2vec
 model...
 thanks.



 --
 View this message in context:
 

Re: Dynamically InferSchema From Hive and Create parquet file

2014-11-07 Thread Michael Armbrust
Perhaps if you can describe what you are trying to accomplish at high level
it'll be easier to help.

On Fri, Nov 7, 2014 at 12:28 AM, Jahagirdar, Madhu 
madhu.jahagir...@philips.com wrote:

 Any idea on this?
 
 From: Jahagirdar, Madhu
 Sent: Thursday, November 06, 2014 12:28 PM
 To: Michael Armbrust
 Cc: u...@spark.incubator.apache.org
 Subject: RE: Dynamically InferSchema From Hive and Create parquet file

 When I create Hive table with Parquet format, it does not create any
 metadata until data in inserted. So data needs to be there before I infer
 the schema otherwise it throws error. Any workaround for this ?
 
 From: Michael Armbrust [mich...@databricks.com]
 Sent: Thursday, November 06, 2014 12:27 AM
 To: Jahagirdar, Madhu
 Cc: u...@spark.incubator.apache.org
 Subject: Re: Dynamically InferSchema From Hive and Create parquet file

 That method is for creating a new directory to hold parquet data when
 there is no hive metastore available, thus you have to specify the schema.

 If you've already created the table in the metastore you can just query it
 using the sql method:

 javahiveConxted.sql(SELECT * FROM parquetTable);

 You can also load the data as a SchemaRDD without using the metastore
 since parquet is self describing:


 javahiveContext.parquetFile(.../path/to/parquetFiles).registerTempTable(parquetData)

 On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu 
 madhu.jahagir...@philips.commailto:madhu.jahagir...@philips.com wrote:
 Currently the createParquetMethod needs BeanClass as one of the parameters.


 javahiveContext.createParquetFile(XBean.class,


   IMPALA_TABLE_LOC, true, new Configuration())


   .registerTempTable(TEMP_TABLE_NAME);


 Is it possible that we dynamically Infer Schema From Hive using hive
 context and the table name, then give that Schema ?


 Regards.

 Madhu Jahagirdar






 
 The information contained in this message may be confidential and legally
 protected under applicable law. The message is intended solely for the
 addressee(s). If you are not the intended recipient, you are hereby
 notified that any use, forwarding, dissemination, or reproduction of this
 message is strictly prohibited and may be unlawful. If you are not the
 intended recipient, please contact the sender by return e-mail and destroy
 all copies of the original message.




partitioning to speed up queries

2014-11-07 Thread Gordon Benjamin
Hi All,

I'm using Spark/Shark as the foundation for some reporting that I'm doing
and have a customers table with approximately 3 million rows that I've
cached in memory.

I've also created a partitioned table that I've also cached in memory on a
per day basis

FROM
customers_cached
INSERT OVERWRITE TABLE
part_customers_cached
PARTITION(createday)
SELECT id,email,dt_cr, to_date(dt_cr) as createday where
dt_crunix_timestamp('2013-01-01 00:00:00') and
dt_crunix_timestamp('2013-12-31 23:59:59');
set exec.dynamic.partition=true;

set exec.dynamic.partition.mode=nonstrict;

however when I run the following basic tests I get this type of performance

[localhost:1] shark select count(*) from part_customers_cached where
 createday = '2014-08-01' and createday = '2014-12-06';
37204
Time taken (including network latency): 3.131 seconds

[localhost:1] shark  SELECT count(*) from customers_cached where
dt_crunix_timestamp('2013-08-01 00:00:00') and
dt_crunix_timestamp('2013-12-06 23:59:59');
37204
Time taken (including network latency): 1.538 seconds

I'm running this on a cluster with one master and two slaves and was hoping
that the partitioned table would be noticeably faster but it looks as
though the partitioning has slowed things down... Is this the case, or is
there some additional configuration that I need to do to speed things up?

Best Wishes,

Gordon


Multiple Applications(Spark Contexts) Concurrently Fail With Broadcast Error

2014-11-07 Thread ryaminal
We are unable to run more than one application at a time using Spark 1.0.0 on
CDH5. We submit two applications using two different SparkContexts on the
same Spark Master. The Spark Master was started using the following command
and parameters and is running in standalone mode:

 /usr/java/jdk1.7.0_55-cloudera/bin/java   -XX:MaxPermSize=128m  
 -Djava.net.preferIPv4Stack=true   -Dspark.akka.logLifecycleEvents=true  
 -Xms8589934592   -Xmx8589934592   org.apache.spark.deploy.master.Master
 --ip ip-10-186-155-45.ec2.internal

When submitting this application by itself it finishes and all of the data
comes out happy. The problem occurs when trying to run another application
while an existing application is still processing and we get an error
stating that the spark contexts were shut down prematurely.The errors can be
viewed in the following pastebins. All IP addresses have been changed to
1.1.1.1 for security reasons. Notice that on the top of the logs we have
printed out the spark config stuff for reference.The working logs:  Working
Pastebin http://pastebin.com/CnitnMhy  The broken logs:  Broken Pastebin
http://pastebin.com/VGs87bBZ  We have also included the worker logs. For
the second app, we see in the work/app/ directory 7 additional directors:
`0/ 1/ 2/ 3/ 4/ 5/ 6/`. There are then two different groups of errors. The
first three are one group and the other 4 are the other group of errors.
Worker log for broken app group 1:  Broken App Group 1
http://pastebin.com/7VwZ1Gwu  Worker log for broken app group 2:  Broken
App Group 2 http://pastebin.com/shs4d8T4  Worker log for working app:
available upon request.
The two different errors are the last lines of both groups and are:

 Received LaunchTask command but executor was null



 Slave registration failed: Duplicate executor ID: 4

tl;drWe are unable to run more than one application in the same spark master
using different spark contexts. The only errors we see are broadcast errors.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Applications-Spark-Contexts-Concurrently-Fail-With-Broadcast-Error-tp18374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Still struggling with building documentation

2014-11-07 Thread Alessandro Baretta
I finally came to realize that there is a special maven target to build the
scaladocs, although arguably a very unintuitive on: mvn verify. So now I
have scaladocs for each package, but not for the whole spark project.
Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
build/docs/api directory referenced in api.html is missing. How do I build
it?

Alex Baretta


jsonRdd and MapType

2014-11-07 Thread boclair
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). 
I'd like some of the json fields to be in a MapType rather than a sub
StructType, as the keys will be very sparse.

For example:
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 val jsonRdd = sc.parallelize(Seq({key: 1234, attributes:
 {gender: m}},
   {key: 4321,
attributes: {location: nyc}}))
 val schemaRdd = sqlContext.jsonRDD(jsonRdd)
 schemaRdd.printSchema
root
 |-- attributes: struct (nullable = true)
 ||-- gender: string (nullable = true)
 ||-- location: string (nullable = true)
 |-- key: string (nullable = true)
 schemaRdd.collect
res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
[[null,nyc],4321])


However this isn't what I want.  So I created my own StructType to pass to
the jsonRDD call:

 import org.apache.spark.sql._
 val st = StructType(Seq(StructField(key, StringType, false),
   StructField(attributes,
MapType(StringType, StringType, false
 val jsonRddSt = sc.parallelize(Seq({key: 1234, attributes:
 {gender: m}},
  {key: 4321,
attributes: {location: nyc}}))
 val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
 schemaRddSt.printSchema
root
 |-- key: string (nullable = false)
 |-- attributes: map (nullable = true)
 ||-- key: string
 ||-- value: string (valueContainsNull = false)
 schemaRddSt.collect
***  Failure  ***
scala.MatchError: MapType(StringType,StringType,false) (of class
org.apache.spark.sql.catalyst.types.MapType)
at 
org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
...

The schema of the schemaRDD is correct.  But it seems that the json cannot
be coerced to a MapType.  I can see at the line in the stack trace that
there is no case statement for MapType.  Is there something I'm missing?  Is
this a bug or decision to not support MapType with json?

Thanks,
Brian




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Still struggling with building documentation

2014-11-07 Thread Nicholas Chammas
I believe the web docs need to be built separately according to the
instructions here
https://github.com/apache/spark/blob/master/docs/README.md.

Did you give those a shot?

It's annoying to have a separate thing with new dependencies in order to
build the web docs, but that's how it is at the moment.

Nick

On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com
wrote:

 I finally came to realize that there is a special maven target to build
 the scaladocs, although arguably a very unintuitive on: mvn verify. So now
 I have scaladocs for each package, but not for the whole spark project.
 Specifically, build/docs/api/scala/index.html is missing. Indeed the whole
 build/docs/api directory referenced in api.html is missing. How do I build
 it?

 Alex Baretta



Re: Any patterns for multiplexing the streaming data

2014-11-07 Thread Tathagata Das
I am not aware of any obvious existing pattern that does exactly this.
Generally this sort of computation (subset, denormalization) things are so
generic sounding terms but actually have very specific requirements that it
hard to refer to a design pattern without more requirement info.

If you want to feed back to kafka, you can take a look at this pull request

https://github.com/apache/spark/pull/2994

On Thu, Nov 6, 2014 at 4:15 PM, bdev buntu...@gmail.com wrote:

 We are looking at consuming the kafka stream using Spark Streaming and
 transform into various subsets like applying some transformation or
 de-normalizing some fields, etc. and feed it back into Kafka as a different
 topic for downstream consumers.

 Wanted to know if there are any existing patterns for achieving this.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Any-patterns-for-multiplexing-the-streaming-data-tp18303.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai
Hello Brian,

Right now, MapType is not supported in the StructType provided to
jsonRDD/jsonFile. We will add the support. I have created
https://issues.apache.org/jira/browse/SPARK-4302 to track this issue.

Thanks,

Yin

On Fri, Nov 7, 2014 at 3:41 PM, boclair bocl...@gmail.com wrote:

 I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)).
 I'd like some of the json fields to be in a MapType rather than a sub
 StructType, as the keys will be very sparse.

 For example:
  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
  import sqlContext.createSchemaRDD
  val jsonRdd = sc.parallelize(Seq({key: 1234, attributes:
  {gender: m}},
{key: 4321,
 attributes: {location: nyc}}))
  val schemaRdd = sqlContext.jsonRDD(jsonRdd)
  schemaRdd.printSchema
 root
  |-- attributes: struct (nullable = true)
  ||-- gender: string (nullable = true)
  ||-- location: string (nullable = true)
  |-- key: string (nullable = true)
  schemaRdd.collect
 res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234],
 [[null,nyc],4321])


 However this isn't what I want.  So I created my own StructType to pass to
 the jsonRDD call:

  import org.apache.spark.sql._
  val st = StructType(Seq(StructField(key, StringType, false),
StructField(attributes,
 MapType(StringType, StringType, false
  val jsonRddSt = sc.parallelize(Seq({key: 1234, attributes:
  {gender: m}},
   {key: 4321,
 attributes: {location: nyc}}))
  val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st)
  schemaRddSt.printSchema
 root
  |-- key: string (nullable = false)
  |-- attributes: map (nullable = true)
  ||-- key: string
  ||-- value: string (valueContainsNull = false)
  schemaRddSt.collect
 ***  Failure  ***
 scala.MatchError: MapType(StringType,StringType,false) (of class
 org.apache.spark.sql.catalyst.types.MapType)
 at
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397)
 ...

 The schema of the schemaRDD is correct.  But it seems that the json cannot
 be coerced to a MapType.  I can see at the line in the stack trace that
 there is no case statement for MapType.  Is there something I'm missing?
 Is
 this a bug or decision to not support MapType with json?

 Thanks,
 Brian




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: deploying a model built in mllib

2014-11-07 Thread Donald Szeto
Hi Chirag,

Could you please provide more information on your Java server environment?

Regards,
Donald
ᐧ

On Fri, Nov 7, 2014 at 9:57 AM, chirag lakhani chirag.lakh...@gmail.com
wrote:

 Thanks for letting me know about this, it looks pretty interesting.  From
 reading the documentation it seems that the server must be built on a Spark
 cluster, is that correct?  Is it possible to deploy it in on a Java
 server?  That is how we are currently running our web app.



 On Tue, Nov 4, 2014 at 7:57 PM, Simon Chan simonc...@gmail.com wrote:

 The latest version of PredictionIO, which is now under Apache 2 license,
 supports the deployment of MLlib models on production.

 The engine you build will including a few components, such as:
 - Data - includes Data Source and Data Preparator
 - Algorithm(s)
 - Serving
 I believe that you can do the feature vector creation inside the Data
 Preparator component.

 Currently, the package comes with two templates: 1)  Collaborative
 Filtering Engine Template - with MLlib ALS; 2) Classification Engine
 Template - with MLlib Naive Bayes. The latter one may be useful to you. And
 you can customize the Algorithm component, too.

 I have just created a doc: http://docs.prediction.io/0.8.1/templates/
 Love to hear your feedback!

 Regards,
 Simon



 On Mon, Oct 27, 2014 at 11:03 AM, chirag lakhani 
 chirag.lakh...@gmail.com wrote:

 Would pipelining include model export?  I didn't see that in the
 documentation.

 Are there ways that this is being done currently?



 On Mon, Oct 27, 2014 at 12:39 PM, Xiangrui Meng men...@gmail.com
 wrote:

 We are working on the pipeline features, which would make this
 procedure much easier in MLlib. This is still a WIP and the main JIRA
 is at:

 https://issues.apache.org/jira/browse/SPARK-1856

 Best,
 Xiangrui

 On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
 chirag.lakh...@gmail.com wrote:
  Hello,
 
  I have been prototyping a text classification model that my company
 would
  like to eventually put into production.  Our technology stack is
 currently
  Java based but we would like to be able to build our models in
 Spark/MLlib
  and then export something like a PMML file which can be used for model
  scoring in real-time.
 
  I have been using scikit learn where I am able to take the training
 data
  convert the text data into a sparse data format and then take the
 other
  features and use the dictionary vectorizer to do one-hot encoding for
 the
  other categorical variables.  All of those things seem to be possible
 in
  mllib but I am still puzzled about how that can be packaged in such a
 way
  that the incoming data can be first made into feature vectors and then
  evaluated as well.
 
  Are there any best practices for this type of thing in Spark?  I hope
 this
  is clear but if there are any confusions then please let me know.
 
  Thanks,
 
  Chirag







-- 
Donald Szeto
PredictionIO


Re: Parallelize on spark context

2014-11-07 Thread _soumya_
Naveen, 
 Don't be worried - you're not the only one to be bitten by this. A little
inspection of the Javadoc told me you have this other option: 

JavaRDDInteger distData = sc.parallelize(data, 100);

-- Now the RDD is split into 100 partitions.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parallelize-on-spark-context-tp18327p18381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark streaming: stderr does not roll

2014-11-07 Thread Nguyen, Duc
We are running spark streaming jobs (version 1.1.0). After a sufficient
amount of time, the stderr file grows until the disk is full at 100% and
crashes the cluster. I've read this

https://github.com/apache/spark/pull/895

and also read this

http://spark.apache.org/docs/latest/configuration.html#spark-streaming


So I've tried testing with this in an attempt to get the stderr log file to
roll.

sparkConf.set(spark.executor.logs.rolling.strategy, size)
.set(spark.executor.logs.rolling.size.maxBytes, 1024)
.set(spark.executor.logs.rolling.maxRetainedFiles, 3)


Yet it does not roll and continues to grow. Am I missing something obvious?


thanks,
Duc


Integrating Spark with other applications

2014-11-07 Thread gtinside
Hi ,

I have been working on Spark SQL and want to expose this functionality to
other applications. Idea is to let other applications to send sql to be
executed on spark cluster and get the result back. I looked at spark job
server (https://github.com/ooyala/spark-jobserver) but it provides a RESTful
interface. I am looking for something similar as
spring-hadoop(http://projects.spring.io/spring-hadoop/) to do a spark-submit
programmatically.

Regards,
Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Integrating-Spark-with-other-applications-tp18383.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Integrating Spark with other applications

2014-11-07 Thread Thomas Risberg
Hi,

I'm a committer on that spring-hadoop project and I'm also interested in
integrating Spark with other Java applications. I would love to see some
guidance from the Spark community for the best way to accomplish this. We
have plans to add features to work with Spark Apps in similar ways we now
support Hive and Pig jobs in the spring-hadoop project. In fact, I added a
spring-hadoop-spark sub-project earlier, but there is no real code there
yet. Hoping to get this added soon, so some helpful pointers would be great.

-Thomas

[1]
https://github.com/spring-projects/spring-hadoop/tree/master/spring-hadoop-spark/src/main/java/org/springframework/data/hadoop/spark

On Fri, Nov 7, 2014 at 5:42 PM, gtinside gtins...@gmail.com wrote:

 Hi ,

 I have been working on Spark SQL and want to expose this functionality to
 other applications. Idea is to let other applications to send sql to be
 executed on spark cluster and get the result back. I looked at spark job
 server (https://github.com/ooyala/spark-jobserver) but it provides a
 RESTful
 interface. I am looking for something similar as
 spring-hadoop(http://projects.spring.io/spring-hadoop/) to do a
 spark-submit
 programmatically.

 Regards,
 Gaurav



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Integrating-Spark-with-other-applications-tp18383.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: error when importing HiveContext

2014-11-07 Thread Davies Liu
bin/pyspark will setup the PYTHONPATH of py4j for you, or you need to
setup it by yourself.

export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip

On Fri, Nov 7, 2014 at 8:15 AM, Pagliari, Roberto
rpagli...@appcomsci.com wrote:
 I’m getting this error when importing hive context



 from pyspark.sql import HiveContext

 Traceback (most recent call last):

   File stdin, line 1, in module

   File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module

 from pyspark.context import SparkContext

   File /path/spark-1.1.0/python/pyspark/context.py, line 30, in module

 from pyspark.java_gateway import launch_gateway

   File /path/spark-1.1.0/python/pyspark/java_gateway.py, line 26, in
 module

 from py4j.java_gateway import java_import, JavaGateway, GatewayClient

 ImportError: No module named py4j.java_gateway



 I cannot find py4j on my system. Where is it?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark context not defined

2014-11-07 Thread Pagliari, Roberto
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and 
hive 0.9.0.

When using python 2.7
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

I'm getting 'sc not defined'

On the other hand, I can see 'sc' from pyspark CLI.

Is there a way to fix it?


MatrixFactorizationModel serialization

2014-11-07 Thread Dariusz Kobylarz
I am trying to persist MatrixFactorizationModel (Collaborative Filtering 
example) and use it in another script to evaluate/apply it.

This is the exception I get when I try to use a deserialized model instance:

Exception in thread main java.lang.NullPointerException
at 
org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at 
org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at 
org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
at java.util.TimSort.sort(TimSort.java:189)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:615)
at scala.collection.AbstractSeq.sorted(Seq.scala:40)
at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594)
at scala.collection.AbstractSeq.sortBy(Seq.scala:40)
at 
org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
at 
org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536)
at 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57)

...

Is this model serializable at all, I noticed it has two RDDs inside 
(user  product features)?


Thanks,



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MatrixFactorizationModel serialization

2014-11-07 Thread Sean Owen
Serializable like a Java object? no, it's an RDD. A factored matrix
model is huge, unlike most models, and is not a local object. You can
of course persist the RDDs to storage manually and read them back.

On Fri, Nov 7, 2014 at 11:33 PM, Dariusz Kobylarz
darek.kobyl...@gmail.com wrote:
 I am trying to persist MatrixFactorizationModel (Collaborative Filtering
 example) and use it in another script to evaluate/apply it.
 This is the exception I get when I try to use a deserialized model instance:

 Exception in thread main java.lang.NullPointerException
 at
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at
 org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at
 org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at
 org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at
 org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
 at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58)
 at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
 at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324)
 at java.util.TimSort.sort(TimSort.java:189)
 at java.util.TimSort.sort(TimSort.java:173)
 at java.util.Arrays.sort(Arrays.java:659)
 at scala.collection.SeqLike$class.sorted(SeqLike.scala:615)
 at scala.collection.AbstractSeq.sorted(Seq.scala:40)
 at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594)
 at scala.collection.AbstractSeq.sortBy(Seq.scala:40)
 at
 org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58)
 at
 org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536)
 at
 org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57)
 ...

 Is this model serializable at all, I noticed it has two RDDs inside (user 
 product features)?

 Thanks,



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkPi endlessly in yarnAppState: ACCEPTED

2014-11-07 Thread YaoPau
I'm using Cloudera 5.1.3, and I'm repeatedly getting the following output
after submitting the SparkPi example in yarn cluster mode
(http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html)
using:

spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster
--master yarn
$SPARK_HOME/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.3.jar 10

Output (repeated):

14/11/07 19:33:05 INFO Client: Application report from ASM: 
 application identifier: application_1415303569855_1100
 appId: 1100
 clientToAMToken: null
 appDiagnostics: 
 appMasterHost: N/A
 appQueue: root.yp
 appMasterRpcPort: -1
 appStartTime: 1415406486231
 yarnAppState: ACCEPTED
 distributedFinalState: UNDEFINED

I'll note that spark-submit is working correctly when running with master
local on the edge node.  

Any ideas how to solve this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-endlessly-in-yarnAppState-ACCEPTED-tp18391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkPi endlessly in yarnAppState: ACCEPTED

2014-11-07 Thread jayunit100
Sounds like no free yarn workers. i.e. try running:

hadoop-mapreduce-examples-2.1.0-beta.jar pi 1 1

We have some smoke tests which you might find particularly usefull for yarn 
clusters as well in https://github.com/apache/bigtop, underneath 
bigtop-tests/smoke-tests which are generally good to 
run on any basic hadoop cluster when you first set it up.

Often if a Yarn job hangs in accepted state, it is waiting for resources to 
free up to start the tasks...

On Nov 7, 2014, at 7:40 PM, YaoPau jonrgr...@gmail.com wrote:

 appStartTime



Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-07 Thread Davies Liu
Could you tell how large is the data set? It will help us to debug this issue.

On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote:
 I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
 the same bug running the 'sort.py' example. On a smaller data set, it worked
 fine. On a larger data set I got this error:

 Traceback (most recent call last):
   File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
 module
 .sortByKey(lambda x: x)
   File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
 bounds.append(samples[index])
 IndexError: list index out of range



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.1.0 Can not read snappy compressed sequence file

2014-11-07 Thread Stéphane Verlet
I first saw this using SparkSQL but the result is the same with plain
Spark.

14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)

Full stack below 

I tried many different thing without luck
* extract the libsnappyjava.so from the Spark assembly and put it on
the library path
   * Added -Djava.library.path=... to  SPARK_MASTER_OPTS
and SPARK_WORKER_OPTS
   * added library path to SPARK_LIBRARY_PATH
   * added hadoop library path to SPARK_LIBRARY_PATH
* Rebuilt spark with different versions (previous and next)  of Snappy
(as seen when Google-ing)


Env :
   Centos 6.4
   Hadoop 2.3 (CDH5.1)
   Running in standalone/local mode


Any help would be appreciated

Thank you

Stephane


scala import org.apache.hadoop.io.BytesWritable
import org.apache.hadoop.io.BytesWritable

scala import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text

scala import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.io.NullWritable

scala var seq =
sc.sequenceFile[NullWritable,Text](/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0).map(_._2.toString())
14/11/07 19:46:19 INFO MemoryStore: ensureFreeSpace(157973) called with
curMem=0, maxMem=278302556
14/11/07 19:46:19 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 154.3 KB, free 265.3 MB)
seq: org.apache.spark.rdd.RDD[String] = MappedRDD[2] at map at console:15

scala seq.collect().foreach(println)
14/11/07 19:46:35 INFO FileInputFormat: Total input paths to process : 1
14/11/07 19:46:35 INFO SparkContext: Starting job: collect at console:18
14/11/07 19:46:35 INFO DAGScheduler: Got job 0 (collect at console:18)
with 2 output partitions (allowLocal=false)
14/11/07 19:46:35 INFO DAGScheduler: Final stage: Stage 0(collect at
console:18)
14/11/07 19:46:35 INFO DAGScheduler: Parents of final stage: List()
14/11/07 19:46:35 INFO DAGScheduler: Missing parents: List()
14/11/07 19:46:35 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[2] at
map at console:15), which has no missing parents
14/11/07 19:46:35 INFO MemoryStore: ensureFreeSpace(2928) called with
curMem=157973, maxMem=278302556
14/11/07 19:46:35 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 2.9 KB, free 265.3 MB)
14/11/07 19:46:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage
0 (MappedRDD[2] at map at console:15)
14/11/07 19:46:36 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/11/07 19:46:36 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, PROCESS_LOCAL, 1243 bytes)
14/11/07 19:46:36 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
1, localhost, PROCESS_LOCAL, 1243 bytes)
14/11/07 19:46:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/11/07 19:46:36 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
14/11/07 19:46:36 INFO HadoopRDD: Input split:
file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0:6504064+6504065
14/11/07 19:46:36 INFO HadoopRDD: Input split:
file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0:0+6504064
14/11/07 19:46:36 INFO deprecation: mapred.tip.id is deprecated. Instead,
use mapreduce.task.id
14/11/07 19:46:36 INFO deprecation: mapred.task.is.map is deprecated.
Instead, use mapreduce.task.ismap
14/11/07 19:46:36 INFO deprecation: mapred.task.partition is deprecated.
Instead, use mapreduce.task.partition
14/11/07 19:46:36 INFO deprecation: mapred.job.id is deprecated. Instead,
use mapreduce.job.id
14/11/07 19:46:36 INFO deprecation: mapred.task.id is deprecated. Instead,
use mapreduce.task.attempt.id
14/11/07 19:46:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.UnsatisfiedLinkError:
org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z
at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native
Method)
at
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63)
at
org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190)
at
org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915)
at
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:197)
at 

How to add elements into map?

2014-11-07 Thread Tim Chou
Here is the code I run in spark-shell:

val table = sc.textFile(args(1))
val histMap = collection.mutable.Map[Int,Int]()
for (x - table) {


val tuple = x.split('|')

  histMap.put(tuple(0).toInt, 1)


}

Why is histMap still null?
Is there something wrong with my code?

Thanks,
Fang


Fwd: How to add elements into map?

2014-11-07 Thread Tim Chou
Here is the code I run in spark-shell:

val table = sc.textFile(args(1))
val histMap = collection.mutable.Map[Int,Int]()
for (x - table) {


val tuple = x.split('|')


histMap.put(tuple(0).toInt, 1)


}

Why is histMap still null?
Is there something wrong with my code?

Thanks,
Tim


Re: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread ll
hi.  i did use local[8] as below, but it still ran on only 1 core.

val sc = new SparkContext(new
SparkConf().setMaster(local[8]).setAppName(abc))

any advice is much appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Why-is-Spark-not-using-all-cores-on-a-single-machine-tp1638p18397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Fwd: Why is Spark not using all cores on a single machine?

2014-11-07 Thread Ganelin, Ilya
To set the number of spark cores used you must set two parameters in the actual 
spark-submit script. You must set num-executors (the number of nodes to have) 
and executor-cores (the number of cores per machinel) . Please see the Spark 
configuration and tuning pages for more details.


-Original Message-
From: ll [duy.huynh@gmail.commailto:duy.huynh@gmail.com]
Sent: Saturday, November 08, 2014 12:05 AM Eastern Standard Time
To: u...@spark.incubator.apache.org
Subject: Re: Fwd: Why is Spark not using all cores on a single machine?


hi.  i did use local[8] as below, but it still ran on only 1 core.

val sc = new SparkContext(new
SparkConf().setMaster(local[8]).setAppName(abc))

any advice is much appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Why-is-Spark-not-using-all-cores-on-a-single-machine-tp1638p18397.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: How to add elements into map?

2014-11-07 Thread lalit1303
It doesn't work that way.
Following is the correct way:

val table = sc.textFile(args(1))
val histMap = table.map(x = {
x.split('|')(0).toInt,1
})




-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-elements-into-map-tp18395p18399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Viewing web UI after fact

2014-11-07 Thread Arun Ahuja
We are running our applications through YARN and are only somtimes seeing
them into the History Server.  Most do not seem to have the
APPLICATION_COMPLETE file.  Specifically any job that ends because of yarn
application -kill does not show up.  For other ones what would be a reason
for them not to appear in the Spark UI?  Is there any update on this?

Thanks,
Arun

On Mon, Sep 15, 2014 at 4:10 AM, Grzegorz Białek 
grzegorz.bia...@codilime.com wrote:

 Hi Andrew,

 sorry for late response. Thank you very much for solving my problem. There
 was no APPLICATION_COMPLETE file in log directory due to not calling
 sc.stop() at the end of program. With stopping spark context everything
 works correctly, so thank you again.

 Best regards,
 Grzegorz


 On Fri, Sep 5, 2014 at 8:06 PM, Andrew Or and...@databricks.com wrote:

 Hi Grzegorz,

 Can you verify that there are APPLICATION_COMPLETE files in the event
 log directories? E.g. Does
 file:/tmp/spark-events/app-name-1234567890/APPLICATION_COMPLETE exist? If
 not, it could be that your application didn't call sc.stop(), so the
 ApplicationEnd event is not actually logged. The HistoryServer looks for
 this special file to identify applications to display. You could also try
 manually adding the APPLICATION_COMPLETE file to this directory; the
 HistoryServer should pick this up and display the application, though the
 information displayed will be incomplete because the log did not capture
 all the events (sc.stop() does a final close() on the file written).

 Andrew


 2014-09-05 1:50 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com:

 Hi Andrew,

 thank you very much for your answer. Unfortunately it still doesn't
 work. I'm using Spark 1.0.0, and I start history server running
 sbin/start-history-server.sh dir, although I also set
  SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory in
 conf/spark-env.sh. I tried also other dir than /tmp/spark-events which
 have all possible permissions enabled. Also adding file: (and file://)
 didn't help - history server still shows:
 History Server
 Event Log Location: file:/tmp/spark-events/
 No Completed Applications Found.

 Best regards,
 Grzegorz


 On Thu, Sep 4, 2014 at 8:20 PM, Andrew Or and...@databricks.com wrote:

 Hi Grzegorz,

 Sorry for the late response. Unfortunately, if the Master UI doesn't
 know about your applications (they are completed with respect to a
 different Master), then it can't regenerate the UIs even if the logs exist.
 You will have to use the history server for that.

 How did you start the history server? If you are using Spark =1.0, you
 can pass the directory as an argument to the sbin/start-history-server.sh
 script. Otherwise, you may need to set the following in your
 conf/spark-env.sh to specify the log directory:

 export
 SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/tmp/spark-events

 It could also be a permissions thing. Make sure your logs in
 /tmp/spark-events are accessible by the JVM that runs the history server.
 Also, there's a chance that /tmp/spark-events is interpreted as an HDFS
 path depending on which Spark version you're running. To resolve any
 ambiguity, you may set the log path to file:/tmp/spark-events instead.
 But first verify whether they actually exist.

 Let me know if you get it working,
 -Andrew



 2014-08-19 8:23 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com
 :

 Hi,
 Is there any way view history of applications statistics in master ui
 after restarting master server? I have all logs ing /tmp/spark-events/ but
 when I start history server in this directory it says No Completed
 Applications Found. Maybe I could copy this logs to dir used by master
 server but I couldn't find any. Or maybe I'm doing something wrong
 launching history server.
 Do you have any idea how to solve it?

 Thanks,
 Grzegorz


 On Thu, Aug 14, 2014 at 10:53 AM, Grzegorz Białek 
 grzegorz.bia...@codilime.com wrote:

 Hi,

 Thank you both for your answers. Browsing using Master UI works fine.
 Unfortunately History Server shows No Completed Applications Found even
 if logs exists under given directory, but using Master UI is enough for 
 me.

 Best regards,
 Grzegorz



 On Wed, Aug 13, 2014 at 8:09 PM, Andrew Or and...@databricks.com
 wrote:

 The Spark UI isn't available through the same address; otherwise new
 applications won't be able to bind to it. Once the old application
 finishes, the standalone Master renders the after-the-fact application 
 UI
 and exposes it under a different URL. To see this, go to the Master UI
 (master-url:8080) and click on your application in the Completed
 Applications table.


 2014-08-13 10:56 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:

 Take a look at http://spark.apache.org/docs/latest/monitoring.html
 -- you need to launch a history server to serve the logs.

 Matei

 On August 13, 2014 at 2:03:08 AM, grzegorz-bialek (
 grzegorz.bia...@codilime.com) wrote:

 Hi,
 I wanted to access Spark web UI after application 

Re: Using partitioning to speed up queries in Shark

2014-11-07 Thread Mayur Rustagi
- dev list  + user list
Shark is not officially supported anymore so you are better off moving to
Spark SQL.
Shark doesnt support Hive partitioning logic anyways, it has its version of
partitioning on in-memory blocks but is independent of whether you
partition your data in hive or not.



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


On Fri, Nov 7, 2014 at 3:31 AM, Gordon Benjamin gordon.benjami...@gmail.com
 wrote:

 Hi All,

 I'm using Spark/Shark as the foundation for some reporting that I'm doing
 and have a customers table with approximately 3 million rows that I've
 cached in memory.

 I've also created a partitioned table that I've also cached in memory on a
 per day basis

 FROM
 customers_cached
 INSERT OVERWRITE TABLE
 part_customers_cached
 PARTITION(createday)
 SELECT id,email,dt_cr, to_date(dt_cr) as createday where
 dt_crunix_timestamp('2013-01-01 00:00:00') and
 dt_crunix_timestamp('2013-12-31 23:59:59');
 set exec.dynamic.partition=true;

 set exec.dynamic.partition.mode=nonstrict;

 however when I run the following basic tests I get this type of performance

 [localhost:1] shark select count(*) from part_customers_cached where
  createday = '2014-08-01' and createday = '2014-12-06';
 37204
 Time taken (including network latency): 3.131 seconds

 [localhost:1] shark  SELECT count(*) from customers_cached where
 dt_crunix_timestamp('2013-08-01 00:00:00') and
 dt_crunix_timestamp('2013-12-06 23:59:59');
 37204
 Time taken (including network latency): 1.538 seconds

 I'm running this on a cluster with one master and two slaves and was hoping
 that the partitioned table would be noticeably faster but it looks as
 though the partitioning has slowed things down... Is this the case, or is
 there some additional configuration that I need to do to speed things up?

 Best Wishes,

 Gordon