Re: Bug in Accumulators...
Could you provide all pieces of codes which can reproduce the bug? Here is my test code: import org.apache.spark._ import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName(SimpleApp) val sc = new SparkContext(conf) val accum = sc.accumulator(0) for (i - 1 to 10) { sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x) } sc.stop() } } It works fine both in client and cluster. Since this is a serialization bug, the outer class does matter. Could you provide it? Is there a SparkContext field in the outer class? Best Regards, Shixiong Zhu 2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch: I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works if I run it in local mode! ) If I put the accumulator inside the for loop, everything will work fine. I guess the bug is that an accumulator can be applied to JUST one RDD. Still another undocumented 'feature' of Spark that no one from the people who maintain Spark is willing to solve or at least to tell us about ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bug in Accumulators...
This may be due in part to Scala allocating an anonymous inner class in order to execute the for loop. I would expect if you change it to a while loop like var i = 0 while (i 10) { sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x) i += 1 } then the problem may go away. I am not super familiar with the closure cleaner, but I believe that we cannot prune beyond 1 layer of references, so the extra class of nesting may be screwing something up. If this is the case, then I would also expect replacing the accumulator with any other reference to the enclosing scope (such as a broadcast variable) would have the same result. On Fri, Nov 7, 2014 at 12:03 AM, Shixiong Zhu zsxw...@gmail.com wrote: Could you provide all pieces of codes which can reproduce the bug? Here is my test code: import org.apache.spark._ import org.apache.spark.SparkContext._ object SimpleApp { def main(args: Array[String]) { val conf = new SparkConf().setAppName(SimpleApp) val sc = new SparkContext(conf) val accum = sc.accumulator(0) for (i - 1 to 10) { sc.parallelize(Array(1, 2, 3, 4)).foreach(x = accum += x) } sc.stop() } } It works fine both in client and cluster. Since this is a serialization bug, the outer class does matter. Could you provide it? Is there a SparkContext field in the outer class? Best Regards, Shixiong Zhu 2014-10-28 0:28 GMT+08:00 octavian.ganea octavian.ga...@inf.ethz.ch: I am also using spark 1.1.0 and I ran it on a cluster of nodes (it works if I run it in local mode! ) If I put the accumulator inside the for loop, everything will work fine. I guess the bug is that an accumulator can be applied to JUST one RDD. Still another undocumented 'feature' of Spark that no one from the people who maintain Spark is willing to solve or at least to tell us about ... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17372.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: CheckPoint Issue with JsonRDD
Michael any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 2:36 PM To: mich...@databricks.com; user Subject: CheckPoint Issue with JsonRDD When we enable checkpoint and use JsonRDD we get the following error: Is this bug ? Exception in thread main java.lang.NullPointerException at org.apache.spark.rdd.RDD.init(RDD.scala:125) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103) at org.apache.spark.sql.SQLContext.applySchema(SQLContext.scala:132) at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:194) at SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:69) at SparkStreamingToParquet$$anonfun$createContext$1.apply(SparkStreamingToParquet.scala:63) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) = import org.apache.hadoop.conf.Configuration import org.apache.spark.sql.catalyst.types.{StructType, StructField, StringType} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{Logging, SparkConf} import org.apache.spark.sql.api.java.JavaSchemaRDD import org.apache.spark.sql.hive.api.java.JavaHiveContext import org.apache.spark.streaming.api.java.JavaStreamingContext import org.apache.spark.streaming.{Duration, Seconds, StreamingContext} object SparkStreamingToParquet extends Logging { /** * * @param args * @throws Exception */ def main(args: Array[String]) { if (args.length 3) { logInfo(Please provide valid parameters: hdfsFilesLocation: hdfs://ip:8020/user/hdfs/--/ IMPALAtableloc hdfs://ip:8020/user/hive/--/ tablename) logInfo(make user you give full folder path with '/' at the end i.e /user/hdfs/abc/) System.exit(1) } val HDFS_FILE_LOC = args(0) val IMPALA_TABLE_LOC = args(1) val TEMP_TABLE_NAME = args(2) val CHECKPOINT_DIR = args(3) val jssc: StreamingContext = StreamingContext.getOrCreate(CHECKPOINT_DIR, ()={ createContext(args) }) jssc.start jssc.awaitTermination } def createContext(args:Array[String]): StreamingContext = { val HDFS_FILE_LOC = args(0) val IMPALA_TABLE_LOC = args(1) val TEMP_TABLE_NAME = args(2) val CHECKPOINT_DIR = args(3) val sparkConf: SparkConf = new SparkConf().setAppName(Json to Parquet).set(spark.cores.max, 3) val jssc: StreamingContext = new StreamingContext(sparkConf, new Duration(3)) val hivecontext: HiveContext = new HiveContext(jssc.sparkContext) hivecontext.createParquetFile[Person](IMPALA_TABLE_LOC,true,org.apache.spark.deploy.SparkHadoopUtil.get.conf).registerTempTable(TEMP_TABLE_NAME); val schemaString = name age val schema = StructType( schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) val textFileStream = jssc.textFileStream(HDFS_FILE_LOC) textFileStream.foreachRDD(rdd = { if(rdd !=null rdd.count()0) { val schRdd = hivecontext.jsonRDD(rdd,schema) logInfo(inserting into table: + TEMP_TABLE_NAME) schRdd.insertInto(TEMP_TABLE_NAME) } }) jssc.checkpoint(CHECKPOINT_DIR) jssc } } case class Person(name:String, age:String) extends Serializable Regards, Madhu jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message. - To unsubscribe,
sql - group by on UDF not working
I am trying to group by on a calculated field. Is it supported on spark sql? I am running it on a nested json structure. Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c group by YEAR(c.Patient.DOB) Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4. Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS DOB#191) AS c_0#185, tree: Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)], [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS DOB#191) AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192, LongType)) AS c_1#186L] Subquery c Subquery claim LogicalRDD [AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14], MappedRDD[5] at map at JsonRDD.scala:43 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at $iwC.init(console:26) at init(console:28) at .init(console:32) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at
about write mongodb in mapPartitions
Hi, everyone I come across with a prolem about writing data to mongodb in mapPartitions, my code is as below: val sourceRDD = sc.textFile(hdfs://host:port/sourcePath) // some transformations val rdd= sourceRDD .map(mapFunc).filter(filterFunc) val newRDD = rdd.mapPartitions(args = { val mongoClient = new MongoClient(host, port) val db = mongoClient.getDB(db) val coll = db.getCollection(collectionA) args.map(arg = { coll.insert(new BasicDBObject(pkg, arg)) arg }) mongoClient.close() args }) newRDD.saveAsTextFile(hdfs://host:port/path) The application saved data to HDFS correctly, but not mongodb, is there someting wrong? I know that collecting the newRDD to driver and then saving it to mongodb will success, but will the following saveAsTextFile read the filesystem once again? Thanks qinwei
Re: about write mongodb in mapPartitions
Why not saveAsNewAPIHadoopFile? //Define your mongoDB confs val config = new Configuration() config.set(mongo.output.uri, mongodb:// 127.0.0.1:27017/sigmoid.output) //Write everything to mongo rdd.saveAsNewAPIHadoopFile(file:///some/random, classOf[Any], classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]], config) Thanks Best Regards On Fri, Nov 7, 2014 at 2:53 PM, qinwei wei@dewmobile.net wrote: Hi, everyone I come across with a prolem about writing data to mongodb in mapPartitions, my code is as below: val sourceRDD = sc.textFile(hdfs://host:port/sourcePath) // some transformations val rdd= sourceRDD .map(mapFunc).filter(filterFunc) val newRDD = rdd.mapPartitions(args = { val mongoClient = new MongoClient(host, port) val db = mongoClient.getDB(db) val coll = db.getCollection(collectionA) args.map(arg = { coll.insert(new BasicDBObject(pkg, arg)) arg }) mongoClient.close() args }) newRDD.saveAsTextFile(hdfs://host:port/path) The application saved data to HDFS correctly, but not mongodb, is there someting wrong? I know that collecting the newRDD to driver and then saving it to mongodb will success, but will the following saveAsTextFile read the filesystem once again? Thanks -- qinwei
Re: multiple spark context in same driver program
My bad, I just fired up a spark-shell and created a new sparkContext and it was working fine. I basically did a parallelize and collect with both sparkContexts. Thanks Best Regards On Fri, Nov 7, 2014 at 3:17 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Nov 7, 2014 at 4:58 PM, Akhil Das ak...@sigmoidanalytics.com wrote: That doc was created during the initial days (Spark 0.8.0), you can of course create multiple sparkContexts in the same driver program now. You sure about that? According to http://apache-spark-user-list.1001560.n3.nabble.com/Is-spark-context-in-local-mode-thread-safe-td7275.html (June 2014), you currently can’t have multiple SparkContext objects in the same JVM. Tobias
Native / C/C++ code integration
Dear List, Has anybody had experience integrating C/C++ code into Spark jobs? I have done some work on this topic using JNA. I wrote a FlatMapFunction that processes all partition entries using a C++ library. This approach works well, but there are some tradeoffs: * Shipping the native dylib with the app jar and loading it at runtime requires a bit of work (on top of normal JNA usage) * Native code doesn't respect the executor heap limits. Under heavy memory pressure, the native code can sometimes ENOMEM sporadically. * While JNA can map Strings, structs, and Java primitive types, the user still needs to deal with more complex objects. E.g. re-serialize protobuf/thrift objects, or provide some other encoding for moving data between Java and C/C++. * C++ static is not thread-safe before C++11, so the user sometimes needs to take care running inside multi-threaded executors * Avoiding memory copies can be a little tricky One other alternative approach comes to mind is pipe(). However, PipedRDD requires copying data over pipes, does not support binary data (?), and native code errors that crash the subprocess don't bubble up to the Spark job as nicely as with JNA. Is there a way to expose raw, in-memory partition/block data to native code? Has anybody else attacked this problem a different way? All the best, -Paul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sql - group by on UDF not working
Now it doesn't support such query. I can easily reproduce it. Created a JIRA here: https://issues.apache.org/jira/browse/SPARK-4296 Best Regards, Shixiong Zhu 2014-11-07 16:44 GMT+08:00 Tridib Samanta tridib.sama...@live.com: I am trying to group by on a calculated field. Is it supported on spark sql? I am running it on a nested json structure. Query: SELECT YEAR(c.Patient.DOB), sum(c.ClaimPay.TotalPayAmnt) FROM claim c group by YEAR(c.Patient.DOB) Spark Version: spark-1.2.0-SNAPSHOT wit Hive and hadoop 2.4. Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS DOB#191) AS c_0#185, tree: Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB)], [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(Patient#8.DOB AS DOB#191) AS c_0#185,SUM(CAST(ClaimPay#5.TotalPayAmnt AS TotalPayAmnt#192, LongType)) AS c_1#186L] Subquery c Subquery claim LogicalRDD [AttendPhysician#0,BillProv#1,Claim#2,ClaimClinic#3,ClaimInfo#4,ClaimPay#5,ClaimTL#6,OpPhysician#7,Patient#8,PayToPhysician#9,Payer#10,Physician#11,RefProv#12,Services#13,Subscriber#14], MappedRDD[5] at map at JsonRDD.scala:43 at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:423) at $iwC$$iwC$$iwC$$iwC.init(console:17) at $iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC.init(console:24) at $iwC.init(console:26) at init(console:28) at .init(console:32) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at
Re: LZO support in Spark 1.0.0 - nothing seems to work
@rogthefrog Were you able to figure out how to fix this issue? Even I tried all combinations that possible but no luck yet. Thanks, Harsha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LZO-support-in-Spark-1-0-0-nothing-seems-to-work-tp14494p18349.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MESOS slaves shut down due to 'health check timed out
Hi guys Do you know how to handle the following case - = From MESOS log file = Slave asked to shut down by master@:5050 because 'health check timed out' I1107 17:33:20.860988 27573 slave.cpp:1337] Asked to shut down framework === Any configurations to increase this timeout interval? Thanks YC - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Store DStreams into Hive using Hive Streaming
Hi Ted and Silvio, thanks for your responses. Hive has a new API for streaming ( https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest) that takes care of compaction and doesn't require any downtime for the table. The data is immediately available and Hive will combine files in background transparently. I was hoping to use this API from within Spark to mitigate the issue with lots of small files... Here's my equivalent code for Trident (work in progress): https://gist.github.com/lgvier/ee28f1c95ac4f60efc3e Trident will coordinate the transaction and send all the tuples from each server/partition to your component at once (Stream.partitionPersist). That is very helpful since Hive expects batches of records instead of one call for each record. I had a look at foreachRDD but it seems to be invoked for each record. I'd like to get all the Stream's records on each server/partition at once. For example, if the stream was processed by 3 servers and resulted in 100 records on each server, I'd like to receive 3 calls (one on each server), each with 100 records. Please let me know if I'm making any sense. I'm fairly new to Spark. Thank you, -Geovani -Geovani On Thu, Nov 6, 2014 at 9:54 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Geovani, You can use HiveContext to do inserts into a Hive table in a Streaming app just as you would a batch app. A DStream is really a collection of RDDs so you can run the insert from within the foreachRDD. You just have to be careful that you’re not creating large amounts of small files. So you may want to either increase the duration of your Streaming batches or repartition right before you insert. You’ll just need to do some testing based on your ingest volume. You may also want to consider streaming into another data store though. Thanks, Silvio From: Luiz Geovani Vier lgv...@gmail.com Date: Thursday, November 6, 2014 at 7:46 PM To: user@spark.apache.org user@spark.apache.org Subject: Store DStreams into Hive using Hive Streaming Hello, Is there a built-in way or connector to store DStream results into an existing Hive ORC table using the Hive/HCatalog Streaming API? Otherwise, do you have any suggestions regarding the implementation of such component? Thank you, -Geovani
Re: word2vec: how to save an mllib model and reload it?
you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
error when importing HiveContext
I'm getting this error when importing hive context from pyspark.sql import HiveContext Traceback (most recent call last): File stdin, line 1, in module File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module from pyspark.context import SparkContext File /path/spark-1.1.0/python/pyspark/context.py, line 30, in module from pyspark.java_gateway import launch_gateway File /path/spark-1.1.0/python/pyspark/java_gateway.py, line 26, in module from py4j.java_gateway import java_import, JavaGateway, GatewayClient ImportError: No module named py4j.java_gateway I cannot find py4j on my system. Where is it?
Re: sparse x sparse matrix multiplication
thanks reza. i'm not familiar with the block matrix multiplication, but is it a good fit for very large dimension, but extremely sparse matrix? if not, what is your recommendation on implementing matrix multiplication in spark on very large dimension, but extremely sparse matrix? On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh r...@databricks.com wrote: See this thread for examples of sparse matrix x sparse matrix: https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA We thought about providing matrix multiplies on CoordinateMatrix, however, the matrices have to be very dense for the overhead of having many little (i, j, value) objects to be worth it. For this reason, we are focused on doing block matrix multiplication first. The goal is version 1.3. Best, Reza On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan w...@us.ibm.com wrote: I think Xiangrui's ALS code implement certain aspect of it. You may want to check it out. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matri From: Xiangrui Meng men...@gmail.com To: Duy Huynh duy.huynh@gmail.com Cc: user u...@spark.incubator.apache.org Date: 11/05/2014 01:13 PM Subject: Re: sparse x sparse matrix multiplication -- You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matrices RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix) and then use join and aggregateByKey to implement this feature, which is the same as in MapReduce. -Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
Currently I see the word2vec model is collected onto the master, so the model itself is not distributed. I guess the question is why do you need a distributed model? Is the vocab size so large that it's necessary? For model serving in general, unless the model is truly massive (ie cannot fit into memory on a modern high end box with 64, or 128GB ram) then single instance is way faster and simpler (using a cluster of machines is more for load balancing / fault tolerance). What is your use case for model serving? — Sent from Mailbox On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
There are a few examples where this is the case. Let's take ALS, where the result is a MatrixFactorizationModel, which is assumed to be big - the model consists of two matrices, one (users x k) and one (k x products). These are represented as RDDs. You can save these RDDs out to disk by doing something like model.userFeatures.saveAsObjectFile(...) and model.productFeatures.saveAsObjectFile(...) to save out to HDFS or Tachyon or S3. Then, when you want to reload you'd have to instantiate them into a class of MatrixFactorizationModel. That class is package private to MLlib right now, so you'd need to copy the logic over to a new class, but that's the basic idea. That said - using spark to serve these recommendations on a point-by-point basis might not be optimal. There's some work going on in the AMPLab to address this issue. On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
where is the org.apache.spark.util package?
i'm trying to compile some of the spark code directly from the source (https://github.com/apache/spark). it complains about the missing package org.apache.spark.util. it doesn't look like this package is part of the source code on github. where can i find this package? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-is-the-org-apache-spark-util-package-tp18360.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
For ALS if you want real time recs (and usually this is order 10s to a few 100s ms response), then Spark is not the way to go - a serving layer like Oryx, or prediction.io is what you want. (At graphflow we've built our own). You hold the factor matrices in memory and do the dot product in real time (with optional caching). Again, even for huge models (10s of millions users/items) this can be handled on a single, powerful instance. The issue at this scale is winnowing down the search space using LSH or similar approach to get to real time speeds. For word2vec it's pretty much the same thing as what you have is very similar to one of the ALS factor matrices. One problem is you can't access the wors2vec vectors as they are private val. I think this should be changed actually, so that just the word vectors could be saved and used in a serving layer. — Sent from Mailbox On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks evan.spa...@gmail.com wrote: There are a few examples where this is the case. Let's take ALS, where the result is a MatrixFactorizationModel, which is assumed to be big - the model consists of two matrices, one (users x k) and one (k x products). These are represented as RDDs. You can save these RDDs out to disk by doing something like model.userFeatures.saveAsObjectFile(...) and model.productFeatures.saveAsObjectFile(...) to save out to HDFS or Tachyon or S3. Then, when you want to reload you'd have to instantiate them into a class of MatrixFactorizationModel. That class is package private to MLlib right now, so you'd need to copy the logic over to a new class, but that's the basic idea. That said - using spark to serve these recommendations on a point-by-point basis might not be optimal. There's some work going on in the AMPLab to address this issue. On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: where is the org.apache.spark.util package?
i found util package under spark core package, but i now got this error Sysmbol Utils is inaccessible from this place. what does this error mean? the org.apache.spark.util and org.apache.spark.spark.Utils are there now. thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-is-the-org-apache-spark-util-package-tp18360p18361.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: deploying a model built in mllib
Thanks for letting me know about this, it looks pretty interesting. From reading the documentation it seems that the server must be built on a Spark cluster, is that correct? Is it possible to deploy it in on a Java server? That is how we are currently running our web app. On Tue, Nov 4, 2014 at 7:57 PM, Simon Chan simonc...@gmail.com wrote: The latest version of PredictionIO, which is now under Apache 2 license, supports the deployment of MLlib models on production. The engine you build will including a few components, such as: - Data - includes Data Source and Data Preparator - Algorithm(s) - Serving I believe that you can do the feature vector creation inside the Data Preparator component. Currently, the package comes with two templates: 1) Collaborative Filtering Engine Template - with MLlib ALS; 2) Classification Engine Template - with MLlib Naive Bayes. The latter one may be useful to you. And you can customize the Algorithm component, too. I have just created a doc: http://docs.prediction.io/0.8.1/templates/ Love to hear your feedback! Regards, Simon On Mon, Oct 27, 2014 at 11:03 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Would pipelining include model export? I didn't see that in the documentation. Are there ways that this is being done currently? On Mon, Oct 27, 2014 at 12:39 PM, Xiangrui Meng men...@gmail.com wrote: We are working on the pipeline features, which would make this procedure much easier in MLlib. This is still a WIP and the main JIRA is at: https://issues.apache.org/jira/browse/SPARK-1856 Best, Xiangrui On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Hello, I have been prototyping a text classification model that my company would like to eventually put into production. Our technology stack is currently Java based but we would like to be able to build our models in Spark/MLlib and then export something like a PMML file which can be used for model scoring in real-time. I have been using scikit learn where I am able to take the training data convert the text data into a sparse data format and then take the other features and use the dictionary vectorizer to do one-hot encoding for the other categorical variables. All of those things seem to be possible in mllib but I am still puzzled about how that can be packaged in such a way that the incoming data can be first made into feature vectors and then evaluated as well. Are there any best practices for this type of thing in Spark? I hope this is clear but if there are any confusions then please let me know. Thanks, Chirag
Re: word2vec: how to save an mllib model and reload it?
hi nick.. sorry about the confusion. originally i had a question specifically about word2vec, but my follow up question on distributed model is a more general question about saving different types of models. on distributed model, i was hoping to implement a model parallelism, so that different workers can work on different parts of the models, and then merge the results at the end at the single master model. On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Currently I see the word2vec model is collected onto the master, so the model itself is not distributed. I guess the question is why do you need a distributed model? Is the vocab size so large that it's necessary? For model serving in general, unless the model is truly massive (ie cannot fit into memory on a modern high end box with 64, or 128GB ram) then single instance is way faster and simpler (using a cluster of machines is more for load balancing / fault tolerance). What is your use case for model serving? — Sent from Mailbox https://www.dropbox.com/mailbox On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
yep, but that's only if they are already represented as RDDs. which is much more convenient for saving and loading. my question is for the use case that they are not represented as RDDs yet. then, do you think if it makes sense to covert them into RDDs, just for the convenience of saving and loading them distributedly? On Fri, Nov 7, 2014 at 12:36 PM, Evan R. Sparks evan.spa...@gmail.com wrote: There are a few examples where this is the case. Let's take ALS, where the result is a MatrixFactorizationModel, which is assumed to be big - the model consists of two matrices, one (users x k) and one (k x products). These are represented as RDDs. You can save these RDDs out to disk by doing something like model.userFeatures.saveAsObjectFile(...) and model.productFeatures.saveAsObjectFile(...) to save out to HDFS or Tachyon or S3. Then, when you want to reload you'd have to instantiate them into a class of MatrixFactorizationModel. That class is package private to MLlib right now, so you'd need to copy the logic over to a new class, but that's the basic idea. That said - using spark to serve these recommendations on a point-by-point basis might not be optimal. There's some work going on in the AMPLab to address this issue. On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
thansk nick. i'll take a look at oryx and prediction.io. re: private val model in word2vec ;) yes, i couldn't wait so i just changed it in the word2vec source code. but i'm running into some compiliation issue now. hopefully i can fix it soon, so to get this things going. On Fri, Nov 7, 2014 at 12:52 PM, Nick Pentreath nick.pentre...@gmail.com wrote: For ALS if you want real time recs (and usually this is order 10s to a few 100s ms response), then Spark is not the way to go - a serving layer like Oryx, or prediction.io is what you want. (At graphflow we've built our own). You hold the factor matrices in memory and do the dot product in real time (with optional caching). Again, even for huge models (10s of millions users/items) this can be handled on a single, powerful instance. The issue at this scale is winnowing down the search space using LSH or similar approach to get to real time speeds. For word2vec it's pretty much the same thing as what you have is very similar to one of the ALS factor matrices. One problem is you can't access the wors2vec vectors as they are private val. I think this should be changed actually, so that just the word vectors could be saved and used in a serving layer. — Sent from Mailbox https://www.dropbox.com/mailbox On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks evan.spa...@gmail.com wrote: There are a few examples where this is the case. Let's take ALS, where the result is a MatrixFactorizationModel, which is assumed to be big - the model consists of two matrices, one (users x k) and one (k x products). These are represented as RDDs. You can save these RDDs out to disk by doing something like model.userFeatures.saveAsObjectFile(...) and model.productFeatures.saveAsObjectFile(...) to save out to HDFS or Tachyon or S3. Then, when you want to reload you'd have to instantiate them into a class of MatrixFactorizationModel. That class is package private to MLlib right now, so you'd need to copy the logic over to a new class, but that's the basic idea. That said - using spark to serve these recommendations on a point-by-point basis might not be optimal. There's some work going on in the AMPLab to address this issue. On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: AVRO specific records
Ok, that turned out to be a dependency issue with Hadoop1 vs. Hadoop2 that I have not fully solved yet. I am able to run with Hadoop1 and AVRO in standalone mode but not with Hadoop2 (even after trying to fix the dependencies). Anyway, I am now trying to write to AVRO, using a very similar snippet to the one to read from AVRO: val withValues : RDD[(AvroKey[Subscriber], NullWritable)] = records.map{s = (new AvroKey(s), NullWritable.get)} val outPath = myOutputPath val writeJob = new Job() FileOutputFormat.setOutputPath(writeJob, new Path(outPath)) AvroJob.setOutputKeySchema(writeJob, Subscriber.getClassSchema()) writeJob.setOutputFormatClass(classOf[AvroKeyOutputFormat[Any]]) records.saveAsNewAPIHadoopFile(outPath, classOf[AvroKey[Subscriber]], classOf[NullWritable], classOf[AvroKeyOutputFormat[Subscriber]], writeJob.getConfiguration) Now, my problem is that this writes to a plain text file. I need to write to binary AVRO. What am I missing? Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Thu, Nov 6, 2014 at 3:15 PM, Simone Franzini captainfr...@gmail.com wrote: Benjamin, Thanks for the snippet. I have tried using it, but unfortunately I get the following exception. I am clueless at what might be wrong. Any ideas? java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Wed, Nov 5, 2014 at 4:24 PM, Laird, Benjamin benjamin.la...@capitalone.com wrote: Something like this works and is how I create an RDD of specific records. val avroRdd = sc.newAPIHadoopFile(twitter.avro, classOf[AvroKeyInputFormat[twitter_schema]], classOf[AvroKey[twitter_schema]], classOf[NullWritable], conf) (From https://github.com/julianpeeters/avro-scala-macro-annotation-examples/blob/master/spark/src/main/scala/AvroSparkScala.scala) Keep in mind you'll need to use the kryo serializer as well. From: Frank Austin Nothaft fnoth...@berkeley.edu Date: Wednesday, November 5, 2014 at 5:06 PM To: Simone Franzini captainfr...@gmail.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: AVRO specific records Hi Simone, Matt Massie put together a good tutorial on his blog http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. If you’re looking for more code using Avro, we use it pretty extensively in our genomics project. Our Avro schemas are here https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl, and we have serialization code here https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization. We use Parquet for storing the Avro records, but there is also an Avro HadoopInputFormat. Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Nov 5, 2014, at 1:25 PM, Simone Franzini captainfr...@gmail.com wrote: How can I read/write AVRO specific records? I found several snippets using generic records, but nothing with specific records so far. Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini -- The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: sparse x sparse matrix multiplication
If you're have very large and very sparse matrix represented as (i, j, value) entries, then you can try the algorithms mentioned in the post https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA brought up earlier. Reza On Fri, Nov 7, 2014 at 8:31 AM, Duy Huynh duy.huynh@gmail.com wrote: thanks reza. i'm not familiar with the block matrix multiplication, but is it a good fit for very large dimension, but extremely sparse matrix? if not, what is your recommendation on implementing matrix multiplication in spark on very large dimension, but extremely sparse matrix? On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh r...@databricks.com wrote: See this thread for examples of sparse matrix x sparse matrix: https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA We thought about providing matrix multiplies on CoordinateMatrix, however, the matrices have to be very dense for the overhead of having many little (i, j, value) objects to be worth it. For this reason, we are focused on doing block matrix multiplication first. The goal is version 1.3. Best, Reza On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan w...@us.ibm.com wrote: I think Xiangrui's ALS code implement certain aspect of it. You may want to check it out. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matri From: Xiangrui Meng men...@gmail.com To: Duy Huynh duy.huynh@gmail.com Cc: user u...@spark.incubator.apache.org Date: 11/05/2014 01:13 PM Subject: Re: sparse x sparse matrix multiplication -- You can use breeze for local sparse-sparse matrix multiplication and then define an RDD of sub-matrices RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix) and then use join and aggregateByKey to implement this feature, which is the same as in MapReduce. -Xiangrui - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: word2vec: how to save an mllib model and reload it?
Just want to elaborate more on Duy's suggestion on using PredictionIO. PredictionIO will store the model automatically if you return it in the training function. An example using CF: def train(data: PreparedData): PersistentMatrixFactorizationModel = { val m = ALS.train(data.ratings, ap.rank, ap.numIterations, ap.lambda) new PersistentMatrixFactorizationModel( rank = m.rank, userFeatures = m.userFeatures, productFeatures = m.productFeatures) } And the persisted model will be passed to the predict function when you query for prediction: def predict( model: PersistentMatrixFactorizationModel, query: Query): PredictedResult = { val productScores = model.recommendProducts(query.user, query.num) .map (r = ProductScore(r.product, r.rating)) new PredictedResult(productScores)} Some templates and tutorials for MLlib are here: http://docs.prediction.io/0.8.1/templates/ Simon On Fri, Nov 7, 2014 at 10:11 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Sure - in theory this sounds great. But in practice it's much faster and a whole lot simpler to just serve the model from single instance in memory. Optionally you can multithread within that (as Oryx 1 does). There are very few real world use cases where the model is so large that it HAS to be distributed. Having said this, it's certainly possible to distribute model serving for factor-like models (like ALS). One idea I'm working on now is using Elasticsearch for exactly this purpose - but that more because I'm using it for filtering of recommendation results and combining with search, so overall it's faster to do it this way. For the pure matrix algebra part, single instance in memory is way faster. — Sent from Mailbox https://www.dropbox.com/mailbox On Fri, Nov 7, 2014 at 8:00 PM, Duy Huynh duy.huynh@gmail.com wrote: hi nick.. sorry about the confusion. originally i had a question specifically about word2vec, but my follow up question on distributed model is a more general question about saving different types of models. on distributed model, i was hoping to implement a model parallelism, so that different workers can work on different parts of the models, and then merge the results at the end at the single master model. On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Currently I see the word2vec model is collected onto the master, so the model itself is not distributed. I guess the question is why do you need a distributed model? Is the vocab size so large that it's necessary? For model serving in general, unless the model is truly massive (ie cannot fit into memory on a modern high end box with 64, or 128GB ram) then single instance is way faster and simpler (using a cluster of machines is more for load balancing / fault tolerance). What is your use case for model serving? — Sent from Mailbox https://www.dropbox.com/mailbox On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote: you're right, serialization works. what is your suggestion on saving a distributed model? so part of the model is in one cluster, and some other parts of the model are in other clusters. during runtime, these sub-models run independently in their own clusters (load, train, save). and at some point during run time these sub-models merge into the master model, which also loads, trains, and saves at the master level. much appreciated. On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com wrote: There's some work going on to support PMML - https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been merged into master. What are you used to doing in other environments? In R I'm used to running save(), same with matlab. In python either pickling things or dumping to json seems pretty common. (even the scikit-learn docs recommend pickling - http://scikit-learn.org/stable/modules/model_persistence.html). These all seem basically equivalent java serialization to me.. Would some helper functions (in, say, mllib.util.modelpersistence or something) make sense to add? On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com wrote: that works. is there a better way in spark? this seems like the most common feature for any machine learning work - to be able to save your model after training it and load it later. On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Plain old java serialization is one straightforward approach if you're in java/scala. On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote: what is the best way to save an mllib model that you just trained and reload it in the future? specifically, i'm using the mllib word2vec model... thanks. -- View this message in context:
Re: Dynamically InferSchema From Hive and Create parquet file
Perhaps if you can describe what you are trying to accomplish at high level it'll be easier to help. On Fri, Nov 7, 2014 at 12:28 AM, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote: Any idea on this? From: Jahagirdar, Madhu Sent: Thursday, November 06, 2014 12:28 PM To: Michael Armbrust Cc: u...@spark.incubator.apache.org Subject: RE: Dynamically InferSchema From Hive and Create parquet file When I create Hive table with Parquet format, it does not create any metadata until data in inserted. So data needs to be there before I infer the schema otherwise it throws error. Any workaround for this ? From: Michael Armbrust [mich...@databricks.com] Sent: Thursday, November 06, 2014 12:27 AM To: Jahagirdar, Madhu Cc: u...@spark.incubator.apache.org Subject: Re: Dynamically InferSchema From Hive and Create parquet file That method is for creating a new directory to hold parquet data when there is no hive metastore available, thus you have to specify the schema. If you've already created the table in the metastore you can just query it using the sql method: javahiveConxted.sql(SELECT * FROM parquetTable); You can also load the data as a SchemaRDD without using the metastore since parquet is self describing: javahiveContext.parquetFile(.../path/to/parquetFiles).registerTempTable(parquetData) On Wed, Nov 5, 2014 at 2:15 AM, Jahagirdar, Madhu madhu.jahagir...@philips.commailto:madhu.jahagir...@philips.com wrote: Currently the createParquetMethod needs BeanClass as one of the parameters. javahiveContext.createParquetFile(XBean.class, IMPALA_TABLE_LOC, true, new Configuration()) .registerTempTable(TEMP_TABLE_NAME); Is it possible that we dynamically Infer Schema From Hive using hive context and the table name, then give that Schema ? Regards. Madhu Jahagirdar The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
partitioning to speed up queries
Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT OVERWRITE TABLE part_customers_cached PARTITION(createday) SELECT id,email,dt_cr, to_date(dt_cr) as createday where dt_crunix_timestamp('2013-01-01 00:00:00') and dt_crunix_timestamp('2013-12-31 23:59:59'); set exec.dynamic.partition=true; set exec.dynamic.partition.mode=nonstrict; however when I run the following basic tests I get this type of performance [localhost:1] shark select count(*) from part_customers_cached where createday = '2014-08-01' and createday = '2014-12-06'; 37204 Time taken (including network latency): 3.131 seconds [localhost:1] shark SELECT count(*) from customers_cached where dt_crunix_timestamp('2013-08-01 00:00:00') and dt_crunix_timestamp('2013-12-06 23:59:59'); 37204 Time taken (including network latency): 1.538 seconds I'm running this on a cluster with one master and two slaves and was hoping that the partitioned table would be noticeably faster but it looks as though the partitioning has slowed things down... Is this the case, or is there some additional configuration that I need to do to speed things up? Best Wishes, Gordon
Multiple Applications(Spark Contexts) Concurrently Fail With Broadcast Error
We are unable to run more than one application at a time using Spark 1.0.0 on CDH5. We submit two applications using two different SparkContexts on the same Spark Master. The Spark Master was started using the following command and parameters and is running in standalone mode: /usr/java/jdk1.7.0_55-cloudera/bin/java -XX:MaxPermSize=128m -Djava.net.preferIPv4Stack=true -Dspark.akka.logLifecycleEvents=true -Xms8589934592 -Xmx8589934592 org.apache.spark.deploy.master.Master --ip ip-10-186-155-45.ec2.internal When submitting this application by itself it finishes and all of the data comes out happy. The problem occurs when trying to run another application while an existing application is still processing and we get an error stating that the spark contexts were shut down prematurely.The errors can be viewed in the following pastebins. All IP addresses have been changed to 1.1.1.1 for security reasons. Notice that on the top of the logs we have printed out the spark config stuff for reference.The working logs: Working Pastebin http://pastebin.com/CnitnMhy The broken logs: Broken Pastebin http://pastebin.com/VGs87bBZ We have also included the worker logs. For the second app, we see in the work/app/ directory 7 additional directors: `0/ 1/ 2/ 3/ 4/ 5/ 6/`. There are then two different groups of errors. The first three are one group and the other 4 are the other group of errors. Worker log for broken app group 1: Broken App Group 1 http://pastebin.com/7VwZ1Gwu Worker log for broken app group 2: Broken App Group 2 http://pastebin.com/shs4d8T4 Worker log for working app: available upon request. The two different errors are the last lines of both groups and are: Received LaunchTask command but executor was null Slave registration failed: Duplicate executor ID: 4 tl;drWe are unable to run more than one application in the same spark master using different spark contexts. The only errors we see are broadcast errors. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Applications-Spark-Contexts-Concurrently-Fail-With-Broadcast-Error-tp18374.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Still struggling with building documentation
I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build/docs/api directory referenced in api.html is missing. How do I build it? Alex Baretta
jsonRdd and MapType
I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). I'd like some of the json fields to be in a MapType rather than a sub StructType, as the keys will be very sparse. For example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val jsonRdd = sc.parallelize(Seq({key: 1234, attributes: {gender: m}}, {key: 4321, attributes: {location: nyc}})) val schemaRdd = sqlContext.jsonRDD(jsonRdd) schemaRdd.printSchema root |-- attributes: struct (nullable = true) ||-- gender: string (nullable = true) ||-- location: string (nullable = true) |-- key: string (nullable = true) schemaRdd.collect res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234], [[null,nyc],4321]) However this isn't what I want. So I created my own StructType to pass to the jsonRDD call: import org.apache.spark.sql._ val st = StructType(Seq(StructField(key, StringType, false), StructField(attributes, MapType(StringType, StringType, false val jsonRddSt = sc.parallelize(Seq({key: 1234, attributes: {gender: m}}, {key: 4321, attributes: {location: nyc}})) val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st) schemaRddSt.printSchema root |-- key: string (nullable = false) |-- attributes: map (nullable = true) ||-- key: string ||-- value: string (valueContainsNull = false) schemaRddSt.collect *** Failure *** scala.MatchError: MapType(StringType,StringType,false) (of class org.apache.spark.sql.catalyst.types.MapType) at org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397) ... The schema of the schemaRDD is correct. But it seems that the json cannot be coerced to a MapType. I can see at the line in the stack trace that there is no case statement for MapType. Is there something I'm missing? Is this a bug or decision to not support MapType with json? Thanks, Brian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Still struggling with building documentation
I believe the web docs need to be built separately according to the instructions here https://github.com/apache/spark/blob/master/docs/README.md. Did you give those a shot? It's annoying to have a separate thing with new dependencies in order to build the web docs, but that's how it is at the moment. Nick On Fri, Nov 7, 2014 at 3:39 PM, Alessandro Baretta alexbare...@gmail.com wrote: I finally came to realize that there is a special maven target to build the scaladocs, although arguably a very unintuitive on: mvn verify. So now I have scaladocs for each package, but not for the whole spark project. Specifically, build/docs/api/scala/index.html is missing. Indeed the whole build/docs/api directory referenced in api.html is missing. How do I build it? Alex Baretta
Re: Any patterns for multiplexing the streaming data
I am not aware of any obvious existing pattern that does exactly this. Generally this sort of computation (subset, denormalization) things are so generic sounding terms but actually have very specific requirements that it hard to refer to a design pattern without more requirement info. If you want to feed back to kafka, you can take a look at this pull request https://github.com/apache/spark/pull/2994 On Thu, Nov 6, 2014 at 4:15 PM, bdev buntu...@gmail.com wrote: We are looking at consuming the kafka stream using Spark Streaming and transform into various subsets like applying some transformation or de-normalizing some fields, etc. and feed it back into Kafka as a different topic for downstream consumers. Wanted to know if there are any existing patterns for achieving this. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-patterns-for-multiplexing-the-streaming-data-tp18303.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: jsonRdd and MapType
Hello Brian, Right now, MapType is not supported in the StructType provided to jsonRDD/jsonFile. We will add the support. I have created https://issues.apache.org/jira/browse/SPARK-4302 to track this issue. Thanks, Yin On Fri, Nov 7, 2014 at 3:41 PM, boclair bocl...@gmail.com wrote: I'm loading json into spark to create a schemaRDD (sqlContext.jsonRDD(..)). I'd like some of the json fields to be in a MapType rather than a sub StructType, as the keys will be very sparse. For example: val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val jsonRdd = sc.parallelize(Seq({key: 1234, attributes: {gender: m}}, {key: 4321, attributes: {location: nyc}})) val schemaRdd = sqlContext.jsonRDD(jsonRdd) schemaRdd.printSchema root |-- attributes: struct (nullable = true) ||-- gender: string (nullable = true) ||-- location: string (nullable = true) |-- key: string (nullable = true) schemaRdd.collect res1: Array[org.apache.spark.sql.Row] = Array([[m,null],1234], [[null,nyc],4321]) However this isn't what I want. So I created my own StructType to pass to the jsonRDD call: import org.apache.spark.sql._ val st = StructType(Seq(StructField(key, StringType, false), StructField(attributes, MapType(StringType, StringType, false val jsonRddSt = sc.parallelize(Seq({key: 1234, attributes: {gender: m}}, {key: 4321, attributes: {location: nyc}})) val schemaRddSt = sqlContext.jsonRDD(jsonRddSt, st) schemaRddSt.printSchema root |-- key: string (nullable = false) |-- attributes: map (nullable = true) ||-- key: string ||-- value: string (valueContainsNull = false) schemaRddSt.collect *** Failure *** scala.MatchError: MapType(StringType,StringType,false) (of class org.apache.spark.sql.catalyst.types.MapType) at org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:397) ... The schema of the schemaRDD is correct. But it seems that the json cannot be coerced to a MapType. I can see at the line in the stack trace that there is no case statement for MapType. Is there something I'm missing? Is this a bug or decision to not support MapType with json? Thanks, Brian -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-tp18376.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: deploying a model built in mllib
Hi Chirag, Could you please provide more information on your Java server environment? Regards, Donald ᐧ On Fri, Nov 7, 2014 at 9:57 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Thanks for letting me know about this, it looks pretty interesting. From reading the documentation it seems that the server must be built on a Spark cluster, is that correct? Is it possible to deploy it in on a Java server? That is how we are currently running our web app. On Tue, Nov 4, 2014 at 7:57 PM, Simon Chan simonc...@gmail.com wrote: The latest version of PredictionIO, which is now under Apache 2 license, supports the deployment of MLlib models on production. The engine you build will including a few components, such as: - Data - includes Data Source and Data Preparator - Algorithm(s) - Serving I believe that you can do the feature vector creation inside the Data Preparator component. Currently, the package comes with two templates: 1) Collaborative Filtering Engine Template - with MLlib ALS; 2) Classification Engine Template - with MLlib Naive Bayes. The latter one may be useful to you. And you can customize the Algorithm component, too. I have just created a doc: http://docs.prediction.io/0.8.1/templates/ Love to hear your feedback! Regards, Simon On Mon, Oct 27, 2014 at 11:03 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Would pipelining include model export? I didn't see that in the documentation. Are there ways that this is being done currently? On Mon, Oct 27, 2014 at 12:39 PM, Xiangrui Meng men...@gmail.com wrote: We are working on the pipeline features, which would make this procedure much easier in MLlib. This is still a WIP and the main JIRA is at: https://issues.apache.org/jira/browse/SPARK-1856 Best, Xiangrui On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani chirag.lakh...@gmail.com wrote: Hello, I have been prototyping a text classification model that my company would like to eventually put into production. Our technology stack is currently Java based but we would like to be able to build our models in Spark/MLlib and then export something like a PMML file which can be used for model scoring in real-time. I have been using scikit learn where I am able to take the training data convert the text data into a sparse data format and then take the other features and use the dictionary vectorizer to do one-hot encoding for the other categorical variables. All of those things seem to be possible in mllib but I am still puzzled about how that can be packaged in such a way that the incoming data can be first made into feature vectors and then evaluated as well. Are there any best practices for this type of thing in Spark? I hope this is clear but if there are any confusions then please let me know. Thanks, Chirag -- Donald Szeto PredictionIO
Re: Parallelize on spark context
Naveen, Don't be worried - you're not the only one to be bitten by this. A little inspection of the Javadoc told me you have this other option: JavaRDDInteger distData = sc.parallelize(data, 100); -- Now the RDD is split into 100 partitions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parallelize-on-spark-context-tp18327p18381.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark streaming: stderr does not roll
We are running spark streaming jobs (version 1.1.0). After a sufficient amount of time, the stderr file grows until the disk is full at 100% and crashes the cluster. I've read this https://github.com/apache/spark/pull/895 and also read this http://spark.apache.org/docs/latest/configuration.html#spark-streaming So I've tried testing with this in an attempt to get the stderr log file to roll. sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) Yet it does not roll and continues to grow. Am I missing something obvious? thanks, Duc
Integrating Spark with other applications
Hi , I have been working on Spark SQL and want to expose this functionality to other applications. Idea is to let other applications to send sql to be executed on spark cluster and get the result back. I looked at spark job server (https://github.com/ooyala/spark-jobserver) but it provides a RESTful interface. I am looking for something similar as spring-hadoop(http://projects.spring.io/spring-hadoop/) to do a spark-submit programmatically. Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Integrating-Spark-with-other-applications-tp18383.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Integrating Spark with other applications
Hi, I'm a committer on that spring-hadoop project and I'm also interested in integrating Spark with other Java applications. I would love to see some guidance from the Spark community for the best way to accomplish this. We have plans to add features to work with Spark Apps in similar ways we now support Hive and Pig jobs in the spring-hadoop project. In fact, I added a spring-hadoop-spark sub-project earlier, but there is no real code there yet. Hoping to get this added soon, so some helpful pointers would be great. -Thomas [1] https://github.com/spring-projects/spring-hadoop/tree/master/spring-hadoop-spark/src/main/java/org/springframework/data/hadoop/spark On Fri, Nov 7, 2014 at 5:42 PM, gtinside gtins...@gmail.com wrote: Hi , I have been working on Spark SQL and want to expose this functionality to other applications. Idea is to let other applications to send sql to be executed on spark cluster and get the result back. I looked at spark job server (https://github.com/ooyala/spark-jobserver) but it provides a RESTful interface. I am looking for something similar as spring-hadoop(http://projects.spring.io/spring-hadoop/) to do a spark-submit programmatically. Regards, Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Integrating-Spark-with-other-applications-tp18383.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: error when importing HiveContext
bin/pyspark will setup the PYTHONPATH of py4j for you, or you need to setup it by yourself. export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip On Fri, Nov 7, 2014 at 8:15 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: I’m getting this error when importing hive context from pyspark.sql import HiveContext Traceback (most recent call last): File stdin, line 1, in module File /path/spark-1.1.0/python/pyspark/__init__.py, line 63, in module from pyspark.context import SparkContext File /path/spark-1.1.0/python/pyspark/context.py, line 30, in module from pyspark.java_gateway import launch_gateway File /path/spark-1.1.0/python/pyspark/java_gateway.py, line 26, in module from py4j.java_gateway import java_import, JavaGateway, GatewayClient ImportError: No module named py4j.java_gateway I cannot find py4j on my system. Where is it? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark context not defined
I'm running the latest version of spark with Hadoop 1.x and scala 2.9.3 and hive 0.9.0. When using python 2.7 from pyspark.sql import HiveContext sqlContext = HiveContext(sc) I'm getting 'sc not defined' On the other hand, I can see 'sc' from pyspark CLI. Is there a way to fix it?
MatrixFactorizationModel serialization
I am trying to persist MatrixFactorizationModel (Collaborative Filtering example) and use it in another script to evaluate/apply it. This is the exception I get when I try to use a deserialized model instance: Exception in thread main java.lang.NullPointerException at org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58) at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58) at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324) at java.util.TimSort.sort(TimSort.java:189) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at scala.collection.SeqLike$class.sorted(SeqLike.scala:615) at scala.collection.AbstractSeq.sorted(Seq.scala:40) at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594) at scala.collection.AbstractSeq.sortBy(Seq.scala:40) at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58) at org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536) at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57) ... Is this model serializable at all, I noticed it has two RDDs inside (user product features)? Thanks, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MatrixFactorizationModel serialization
Serializable like a Java object? no, it's an RDD. A factored matrix model is huge, unlike most models, and is not a local object. You can of course persist the RDDs to storage manually and read them back. On Fri, Nov 7, 2014 at 11:33 PM, Dariusz Kobylarz darek.kobyl...@gmail.com wrote: I am trying to persist MatrixFactorizationModel (Collaborative Filtering example) and use it in another script to evaluate/apply it. This is the exception I get when I try to use a deserialized model instance: Exception in thread main java.lang.NullPointerException at org.apache.spark.rdd.CoGroupedRDD$$anonfun$getPartitions$1.apply$mcVI$sp(CoGroupedRDD.scala:103) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.rdd.CoGroupedRDD.getPartitions(CoGroupedRDD.scala:101) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedValuesRDD.getPartitions(MappedValuesRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FlatMappedValuesRDD.getPartitions(FlatMappedValuesRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58) at org.apache.spark.Partitioner$$anonfun$2.apply(Partitioner.scala:58) at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:324) at java.util.TimSort.sort(TimSort.java:189) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at scala.collection.SeqLike$class.sorted(SeqLike.scala:615) at scala.collection.AbstractSeq.sorted(Seq.scala:40) at scala.collection.SeqLike$class.sortBy(SeqLike.scala:594) at scala.collection.AbstractSeq.sortBy(Seq.scala:40) at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:58) at org.apache.spark.rdd.PairRDDFunctions.join(PairRDDFunctions.scala:536) at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:57) ... Is this model serializable at all, I noticed it has two RDDs inside (user product features)? Thanks, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkPi endlessly in yarnAppState: ACCEPTED
I'm using Cloudera 5.1.3, and I'm repeatedly getting the following output after submitting the SparkPi example in yarn cluster mode (http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_running_spark_apps.html) using: spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn $SPARK_HOME/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.3.jar 10 Output (repeated): 14/11/07 19:33:05 INFO Client: Application report from ASM: application identifier: application_1415303569855_1100 appId: 1100 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: root.yp appMasterRpcPort: -1 appStartTime: 1415406486231 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED I'll note that spark-submit is working correctly when running with master local on the edge node. Any ideas how to solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkPi-endlessly-in-yarnAppState-ACCEPTED-tp18391.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkPi endlessly in yarnAppState: ACCEPTED
Sounds like no free yarn workers. i.e. try running: hadoop-mapreduce-examples-2.1.0-beta.jar pi 1 1 We have some smoke tests which you might find particularly usefull for yarn clusters as well in https://github.com/apache/bigtop, underneath bigtop-tests/smoke-tests which are generally good to run on any basic hadoop cluster when you first set it up. Often if a Yarn job hangs in accepted state, it is waiting for resources to free up to start the tasks... On Nov 7, 2014, at 7:40 PM, YaoPau jonrgr...@gmail.com wrote: appStartTime
Re: PySpark issue with sortByKey: IndexError: list index out of range
Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.1.0 Can not read snappy compressed sequence file
I first saw this using SparkSQL but the result is the same with plain Spark. 14/11/07 19:46:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) Full stack below I tried many different thing without luck * extract the libsnappyjava.so from the Spark assembly and put it on the library path * Added -Djava.library.path=... to SPARK_MASTER_OPTS and SPARK_WORKER_OPTS * added library path to SPARK_LIBRARY_PATH * added hadoop library path to SPARK_LIBRARY_PATH * Rebuilt spark with different versions (previous and next) of Snappy (as seen when Google-ing) Env : Centos 6.4 Hadoop 2.3 (CDH5.1) Running in standalone/local mode Any help would be appreciated Thank you Stephane scala import org.apache.hadoop.io.BytesWritable import org.apache.hadoop.io.BytesWritable scala import org.apache.hadoop.io.Text import org.apache.hadoop.io.Text scala import org.apache.hadoop.io.NullWritable import org.apache.hadoop.io.NullWritable scala var seq = sc.sequenceFile[NullWritable,Text](/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0).map(_._2.toString()) 14/11/07 19:46:19 INFO MemoryStore: ensureFreeSpace(157973) called with curMem=0, maxMem=278302556 14/11/07 19:46:19 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 154.3 KB, free 265.3 MB) seq: org.apache.spark.rdd.RDD[String] = MappedRDD[2] at map at console:15 scala seq.collect().foreach(println) 14/11/07 19:46:35 INFO FileInputFormat: Total input paths to process : 1 14/11/07 19:46:35 INFO SparkContext: Starting job: collect at console:18 14/11/07 19:46:35 INFO DAGScheduler: Got job 0 (collect at console:18) with 2 output partitions (allowLocal=false) 14/11/07 19:46:35 INFO DAGScheduler: Final stage: Stage 0(collect at console:18) 14/11/07 19:46:35 INFO DAGScheduler: Parents of final stage: List() 14/11/07 19:46:35 INFO DAGScheduler: Missing parents: List() 14/11/07 19:46:35 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[2] at map at console:15), which has no missing parents 14/11/07 19:46:35 INFO MemoryStore: ensureFreeSpace(2928) called with curMem=157973, maxMem=278302556 14/11/07 19:46:35 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.9 KB, free 265.3 MB) 14/11/07 19:46:36 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[2] at map at console:15) 14/11/07 19:46:36 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/11/07 19:46:36 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1243 bytes) 14/11/07 19:46:36 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1243 bytes) 14/11/07 19:46:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/11/07 19:46:36 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 14/11/07 19:46:36 INFO HadoopRDD: Input split: file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0:6504064+6504065 14/11/07 19:46:36 INFO HadoopRDD: Input split: file:/home/lfs/warehouse/base.db/mytable/event_date=2014-06-01/00_0:0+6504064 14/11/07 19:46:36 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/11/07 19:46:36 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/11/07 19:46:36 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 14/11/07 19:46:36 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/11/07 19:46:36 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/11/07 19:46:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z at org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) at org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190) at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1915) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:197) at
How to add elements into map?
Here is the code I run in spark-shell: val table = sc.textFile(args(1)) val histMap = collection.mutable.Map[Int,Int]() for (x - table) { val tuple = x.split('|') histMap.put(tuple(0).toInt, 1) } Why is histMap still null? Is there something wrong with my code? Thanks, Fang
Fwd: How to add elements into map?
Here is the code I run in spark-shell: val table = sc.textFile(args(1)) val histMap = collection.mutable.Map[Int,Int]() for (x - table) { val tuple = x.split('|') histMap.put(tuple(0).toInt, 1) } Why is histMap still null? Is there something wrong with my code? Thanks, Tim
Re: Fwd: Why is Spark not using all cores on a single machine?
hi. i did use local[8] as below, but it still ran on only 1 core. val sc = new SparkContext(new SparkConf().setMaster(local[8]).setAppName(abc)) any advice is much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Why-is-Spark-not-using-all-cores-on-a-single-machine-tp1638p18397.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Fwd: Why is Spark not using all cores on a single machine?
To set the number of spark cores used you must set two parameters in the actual spark-submit script. You must set num-executors (the number of nodes to have) and executor-cores (the number of cores per machinel) . Please see the Spark configuration and tuning pages for more details. -Original Message- From: ll [duy.huynh@gmail.commailto:duy.huynh@gmail.com] Sent: Saturday, November 08, 2014 12:05 AM Eastern Standard Time To: u...@spark.incubator.apache.org Subject: Re: Fwd: Why is Spark not using all cores on a single machine? hi. i did use local[8] as below, but it still ran on only 1 core. val sc = new SparkContext(new SparkConf().setMaster(local[8]).setAppName(abc)) any advice is much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Why-is-Spark-not-using-all-cores-on-a-single-machine-tp1638p18397.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.
Re: How to add elements into map?
It doesn't work that way. Following is the correct way: val table = sc.textFile(args(1)) val histMap = table.map(x = { x.split('|')(0).toInt,1 }) - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-elements-into-map-tp18395p18399.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Viewing web UI after fact
We are running our applications through YARN and are only somtimes seeing them into the History Server. Most do not seem to have the APPLICATION_COMPLETE file. Specifically any job that ends because of yarn application -kill does not show up. For other ones what would be a reason for them not to appear in the Spark UI? Is there any update on this? Thanks, Arun On Mon, Sep 15, 2014 at 4:10 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi Andrew, sorry for late response. Thank you very much for solving my problem. There was no APPLICATION_COMPLETE file in log directory due to not calling sc.stop() at the end of program. With stopping spark context everything works correctly, so thank you again. Best regards, Grzegorz On Fri, Sep 5, 2014 at 8:06 PM, Andrew Or and...@databricks.com wrote: Hi Grzegorz, Can you verify that there are APPLICATION_COMPLETE files in the event log directories? E.g. Does file:/tmp/spark-events/app-name-1234567890/APPLICATION_COMPLETE exist? If not, it could be that your application didn't call sc.stop(), so the ApplicationEnd event is not actually logged. The HistoryServer looks for this special file to identify applications to display. You could also try manually adding the APPLICATION_COMPLETE file to this directory; the HistoryServer should pick this up and display the application, though the information displayed will be incomplete because the log did not capture all the events (sc.stop() does a final close() on the file written). Andrew 2014-09-05 1:50 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com: Hi Andrew, thank you very much for your answer. Unfortunately it still doesn't work. I'm using Spark 1.0.0, and I start history server running sbin/start-history-server.sh dir, although I also set SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory in conf/spark-env.sh. I tried also other dir than /tmp/spark-events which have all possible permissions enabled. Also adding file: (and file://) didn't help - history server still shows: History Server Event Log Location: file:/tmp/spark-events/ No Completed Applications Found. Best regards, Grzegorz On Thu, Sep 4, 2014 at 8:20 PM, Andrew Or and...@databricks.com wrote: Hi Grzegorz, Sorry for the late response. Unfortunately, if the Master UI doesn't know about your applications (they are completed with respect to a different Master), then it can't regenerate the UIs even if the logs exist. You will have to use the history server for that. How did you start the history server? If you are using Spark =1.0, you can pass the directory as an argument to the sbin/start-history-server.sh script. Otherwise, you may need to set the following in your conf/spark-env.sh to specify the log directory: export SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/tmp/spark-events It could also be a permissions thing. Make sure your logs in /tmp/spark-events are accessible by the JVM that runs the history server. Also, there's a chance that /tmp/spark-events is interpreted as an HDFS path depending on which Spark version you're running. To resolve any ambiguity, you may set the log path to file:/tmp/spark-events instead. But first verify whether they actually exist. Let me know if you get it working, -Andrew 2014-08-19 8:23 GMT-07:00 Grzegorz Białek grzegorz.bia...@codilime.com : Hi, Is there any way view history of applications statistics in master ui after restarting master server? I have all logs ing /tmp/spark-events/ but when I start history server in this directory it says No Completed Applications Found. Maybe I could copy this logs to dir used by master server but I couldn't find any. Or maybe I'm doing something wrong launching history server. Do you have any idea how to solve it? Thanks, Grzegorz On Thu, Aug 14, 2014 at 10:53 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, Thank you both for your answers. Browsing using Master UI works fine. Unfortunately History Server shows No Completed Applications Found even if logs exists under given directory, but using Master UI is enough for me. Best regards, Grzegorz On Wed, Aug 13, 2014 at 8:09 PM, Andrew Or and...@databricks.com wrote: The Spark UI isn't available through the same address; otherwise new applications won't be able to bind to it. Once the old application finishes, the standalone Master renders the after-the-fact application UI and exposes it under a different URL. To see this, go to the Master UI (master-url:8080) and click on your application in the Completed Applications table. 2014-08-13 10:56 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com: Take a look at http://spark.apache.org/docs/latest/monitoring.html -- you need to launch a history server to serve the logs. Matei On August 13, 2014 at 2:03:08 AM, grzegorz-bialek ( grzegorz.bia...@codilime.com) wrote: Hi, I wanted to access Spark web UI after application
Re: Using partitioning to speed up queries in Shark
- dev list + user list Shark is not officially supported anymore so you are better off moving to Spark SQL. Shark doesnt support Hive partitioning logic anyways, it has its version of partitioning on in-memory blocks but is independent of whether you partition your data in hive or not. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Nov 7, 2014 at 3:31 AM, Gordon Benjamin gordon.benjami...@gmail.com wrote: Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT OVERWRITE TABLE part_customers_cached PARTITION(createday) SELECT id,email,dt_cr, to_date(dt_cr) as createday where dt_crunix_timestamp('2013-01-01 00:00:00') and dt_crunix_timestamp('2013-12-31 23:59:59'); set exec.dynamic.partition=true; set exec.dynamic.partition.mode=nonstrict; however when I run the following basic tests I get this type of performance [localhost:1] shark select count(*) from part_customers_cached where createday = '2014-08-01' and createday = '2014-12-06'; 37204 Time taken (including network latency): 3.131 seconds [localhost:1] shark SELECT count(*) from customers_cached where dt_crunix_timestamp('2013-08-01 00:00:00') and dt_crunix_timestamp('2013-12-06 23:59:59'); 37204 Time taken (including network latency): 1.538 seconds I'm running this on a cluster with one master and two slaves and was hoping that the partitioned table would be noticeably faster but it looks as though the partitioning has slowed things down... Is this the case, or is there some additional configuration that I need to do to speed things up? Best Wishes, Gordon