Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError
Try adding all the jars in your $HIVE/lib directory. If you want the specific jar, you could look fr jackson or json serde in it. Thanks Best Regards On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote: I have a feeling I’m missing a Jar that provides the support or could this may be related to https://issues.apache.org/jira/browse/SPARK-5792. If it is a Jar where would I find that ? I would have thought in the $HIVE/lib folder, but not sure which jar contains it. Error: Create Metric Temporary Table for querying15/04/01 14:41:44 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore15/04/01 14:41:44 INFO ObjectStore: ObjectStore, initialize called15/04/01 14:41:45 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored15/04/01 14:41:45 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored15/04/01 14:41:45 INFO BlockManager: Removing broadcast 015/04/01 14:41:45 INFO BlockManager: Removing block broadcast_015/04/01 14:41:45 INFO MemoryStore: Block broadcast_0 of size 1272 dropped from memory (free 278018571)15/04/01 14:41:45 INFO BlockManager: Removing block broadcast_0_piece015/04/01 14:41:45 INFO MemoryStore: Block broadcast_0_piece0 of size 869 dropped from memory (free 278019440)15/04/01 14:41:45 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.1.5:63230 in memory (size: 869.0 B, free: 265.1 MB)15/04/01 14:41:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece015/04/01 14:41:45 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.1.5:63278 in memory (size: 869.0 B, free: 530.0 MB)15/04/01 14:41:45 INFO ContextCleaner: Cleaned broadcast 015/04/01 14:41:46 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order15/04/01 14:41:46 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table.15/04/01 14:41:46 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table.15/04/01 14:41:47 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table.15/04/01 14:41:47 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table.15/04/01 14:41:47 INFO Query: Reading in results for query org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is closing15/04/01 14:41:47 INFO ObjectStore: Initialized ObjectStore15/04/01 14:41:47 INFO HiveMetaStore: Added admin role in metastore15/04/01 14:41:47 INFO HiveMetaStore: Added public role in metastore15/04/01 14:41:48 INFO HiveMetaStore: No user is added in admin role, since config is empty15/04/01 14:41:48 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr.15/04/01 14:41:49 INFO ParseDriver: Parsing command: SELECT path, name, value, v1.peValue, v1.peName FROM metric lateral view json_tuple(pathElements, 'name', 'value') v1 as peName, peValue15/04/01 14:41:49 INFO ParseDriver: Parse CompletedException in thread main java.lang.ClassNotFoundException: json_tuple at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:141) at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:267) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:267) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:272) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:272) at org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:278) at org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50) at scala.Option.map(Option.scala:145) at
Re: StackOverflow Problem with 1.3 mllib ALS
I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1, 2015 at 11:54 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there any change to the ALS algorithm? And are there any ways to achieve more iterations? Thanks. Justin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark, snappy and HDFS
Yes, any Hadoop-related process that asks for Snappy compression or needs to read it will have to have the Snappy libs available on the library path. That's usually set up for you in a distro or you can do it manually like this. This is not Spark-specific. The second question also isn't Spark-specific; you do not have a SequenceFile of byte[] / String, but of byte[] / byte[]. Review what you are writing since it is not BytesWritable / Text. On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers n.e.trav...@gmail.com wrote: I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just using a one-worker setup at present) and adding the following to spark-env.sh: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar I can get past the previous error. The issue now seems to be with what is being returned. import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() returns the following error: java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.io.Text On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote: Do you have the same hadoop config for all nodes in your cluster(you run it in a cluster, right?)? Check the node(usually the executor) which gives the java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the hadoop native lib path. On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote: Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark at this file it can't seem to read it due to the missing snappy jars / so's. I'l paying around with adding some things to spark-env.sh file, but still nothing. On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote: Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() but fails when the encoding is Snappy. I've seen some stuff floating around on the web about having to explicitly enable support for Snappy in spark, but it doesn't seem to work for me: http://www.ericlin.me/enabling-snappy-support-for-sharkspark http://www.ericlin.me/enabling-snappy-support-for-sharkspark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to learn Spark ?
Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo
Support for Data flow graphs and not DAG only
Hey , I didn't find any documentation regarding support for cycles in spark topology , although storm supports this using manual configuration in acker function logic (setting it to a particular count) .By cycles i doesn't mean infinite loops . -- Thanks Regards, Anshu Shukla
Re: StackOverflow Problem with 1.3 mllib ALS
Thanks Xiangrui, I used 80 iterations to demonstrates the marginal diminishing return in prediction quality :) Justin On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote: I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1, 2015 at 11:54 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there any change to the ALS algorithm? And are there any ways to achieve more iterations? Thanks. Justin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to find near duplicate items from given dataset using spark
Hi All, I want to find near duplicate items from given dataset For e.g consider a data set 1. Cricket,bat,ball,stumps 2. Cricket,bowler,ball,stumps, 3. Football,goalie,midfielder,goal 4. Football,refree,midfielder,goal, Here 1 and 2 are near duplicates (only field 2 is different ) and 3 and 4 are near duplicates(only 2 field is different) This is what I did Created an Article class and implemented equls and hashcode method (my hash code method returns constant (1) for all objecst). And in spark I am using article as a key doing group by on the article. Is this approach correct, or is there any better approach. This is how my code looks like. Article Class public class Article implements Serializable { private static final long serialVersionUID = 1L; private String first; private String second; private String third; private String fourth; public Article() { set(, , , ); } public Article(String first, String second, String third, String fourth) { // super(); set(first, second, third, fourth); } @Override public int hashCode() { int result = 1; return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Article other = (Article) obj; if ((first.equals(other.first) || second.equals(other.second) || third.equals(other.third) || fourth.equals(other.fourth))) { return true; } else { return false; } } private void set(String first, String second, String third, String fourth) { this.first = first; this.second = second; this.third = third; this.fourth = fourth; } Spark Code public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName(JavaWordCount) .setMaster(local); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDDString lines = ctx.textFile(data1/*); JavaRDDArticle articles = lines.map(new FunctionString, Article() { /** * */ private static final long serialVersionUID = 1L; public Article call(String line) throws Exception { String[] words = line.split(,); // System.out.println(line); Article article = new Article(words[0], words[1], words[2], words[3]); return article; } }); JavaPairRDDArticle, String articlePair = lines .mapToPair(new PairFunctionString, Article, String() { public Tuple2Article, String call(String line) throws Exception { String[] words = line.split(,); // System.out.println(line); Article article = new Article(words[0], words[1], words[2], words[3]); return new Tuple2Article, String(article, line); } }); JavaPairRDDArticle, IterableString articlePairs = articlePair .groupByKey(); MapArticle, IterableString dupArticles = articlePairs .collectAsMap(); System.out.println(size {} + dupArticles.size()); SetArticle uniqueArticle = dupArticles.keySet(); for (Article article : uniqueArticle) { IterableString temps = dupArticles.get(article); System.out.println(keys + article); for (String string : temps) { System.out.println(string); } System.out.println(==); } ctx.close(); ctx.stop(); } } CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or
Re: pyspark hbase range scan
Hi, Maybe this might be helpful: https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala Cheers Gen On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I am attempting to read an hbase table in pyspark with a range scan. conf = { hbase.zookeeper.quorum: host, hbase.mapreduce.inputtable: table, hbase.mapreduce.scan : scan } hbase_rdd = sc.newAPIHadoopRDD( org.apache.hadoop.hbase.mapreduce.TableInputFormat, org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result, keyConverter=keyConv, valueConverter=valueConv, conf=conf) If i jump over to scala or java and generate a base64 encoded protobuf scan object and convert it to a string, i can use that value for hbase.mapreduce.scan and everything works, the rdd will correctly perform the range scan and I am happy. The problem is that I can not find any reasonable way to generate that range scan string in python. The scala code required is: import org.apache.hadoop.hbase.util.Base64; import org.apache.hadoop.hbase.protobuf.ProtobufUtil; import org.apache.hadoop.hbase.client.{Delete, HBaseAdmin, HTable, Put, Result = HBaseResult, Scan} val scan = new Scan() scan.setStartRow(test_domain\0email.getBytes) scan.setStopRow(test_domain\0email~.getBytes) def scanToString(scan:Scan): String = { Base64.encodeBytes( ProtobufUtil.toScan(scan).toByteArray()) } scanToString(scan) Is there another way to perform an hbase range scan from pyspark or is that functionality something that might be supported in the future? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-hbase-range-scan-tp22348.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Streaming anomaly detection using ARIMA
This inside out parallelization has been a way people have used R with MapReduce for a long time. Run N copies of an R script on the cluster, on different subsets of the data, babysat by Mappers. You just need R installed on the cluster. Hadoop Streaming makes this easy and things like RDD.pipe in Spark make it easier. So it may be just that simple and so there's not much to say about it. I haven't tried this with Spark Streaming but imagine it would also work. Have you tried this? Within a window you would probably take the first x% as training and the rest as test. I don't think there's a question of looking across windows. On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote: Surprised I haven't gotten any responses about this. Has anyone tried using rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the other way- what I'd like to do is use R for model calculation and Spark to distribute the load across the cluster. Also, has anyone used Scalation for ARIMA models? On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote: Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API. On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote: I want to use ARIMA for a predictive model so that I can take time series data (metrics) and perform a light anomaly detection. The time series data is going to be bucketed to different time units (several minutes within several hours, several hours within several days, several days within several years. I want to do the algorithm in Spark Streaming. I'm used to tuple at a time streaming and I'm having a tad bit of trouble gaining insight into how exactly the windows are managed inside of DStreams. Let's say I have a simple dataset that is marked by a key/value tuple where the key is the name of the component who's metrics I want to run the algorithm against and the value is a metric (a value representing a sum for the time bucket. I want to create histograms of the time series data for each key in the windows in which they reside so I can use that histogram vector to generate my ARIMA prediction (actually, it seems like this doesn't just apply to ARIMA but could apply to any sliding average). I *think* my prediction code may look something like this: val predictionAverages = dstream .groupByKeyAndWindow(60*60*24, 60*60*24) .mapValues(applyARIMAFunction) That is, keep 24 hours worth of metrics in each window and use that for the ARIMA prediction. The part I'm struggling with is how to join together the actual values so that i can do my comparison against the prediction model. Let's say dstream contains the actual values. For any time window, I should be able to take a previous set of windows and use model to compare against the current values. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
JAVA_HOME problem
spark 1.3.0 spark@pc-zjqdyyn1:~ tail /etc/profile export JAVA_HOME=/usr/jdk64/jdk1.7.0_45 export PATH=$PATH:$JAVA_HOME/bin # # End of /etc/profile # But ERROR LOG Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454 LogType: stderr LogLength: 61 Log Contents: /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory LogType: stdout LogLength: 0 Log Contents:
Re: Spark throws rsync: change_dir errors on startup
Hi, Verbose output showed no additional information about the origin of the error rsync from right sending incremental file list sent 20 bytes received 12 bytes 64.00 bytes/sec total size is 0 speedup is 0.00 starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark130/sbin/../logs/spark-huser-org.apache.spark.deploy.master.Master-1-cl-pc6.out left: rsync from right left: rsync: change_dir /usr/local/spark130//right failed: No such file or directory (2) left: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.0] left: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark130/sbin/../logs/spark-huser-org.apache.spark.deploy.worker.Worker-1-cl-pc5.out right: rsync from right right: sending incremental file list right: rsync: change_dir /usr/local/spark130//right failed: No such file or directory (2) right: right: sent 20 bytes received 12 bytes 64.00 bytes/sec right: total size is 0 speedup is 0.00 right: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.0] right: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark130/sbin/../logs/spark-huser-org.apache.spark.deploy.worker.Worker-1-cl-pc6.out I also edited the script to remove the additional slash, but this did not help either. The workers are basically started by the script it is just this error message that is thrown. Now, funny thing. I was so brave to create the folder //right Spark is desperately looking for. Guess what, this caused to a complete wipe of my local spark installation /usr/local/spark130 was cleaned completely expect for the logs folder…. Any suggestions what is happening here? Von: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com Datum: Donnerstag, 2. April 2015 07:51 An: Tobias Horsmann tobias.horsm...@uni-due.demailto:tobias.horsm...@uni-due.de Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Betreff: Re: Spark throws rsync: change_dir errors on startup Error 23 is defined as a partial transfer and might be caused by filesystem incompatibilities, such as different character sets or access control lists. In this case it could be caused by the double slashes (// at the end of sbin), You could try editing your sbin/spark-daemon.sh file, look for rsync inside the file, add -v along with that command to see what exactly i going wrong. Thanks Best Regards On Wed, Apr 1, 2015 at 7:25 PM, Horsmann, Tobias tobias.horsm...@uni-due.demailto:tobias.horsm...@uni-due.de wrote: Hi, I try to set up a minimal 2-node spark cluster for testing purposes. When I start the cluster with start-all.sh I get a rsync error message: rsync: change_dir /usr/local/spark130/sbin//right failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.0] (For clarification, my 2 nodes are called ‚right‘ and ‚left‘ referencing to the physical machines standing in front of me) It seems that a file named after my master node ‚right‘ is expected to exist and the synchronisation with it fails as it does not exist. I don’t understand what Spark is trying to do here. Why does it expect this file to exist and what content should it have? I assume I did something wrong in my configuration setup – can someone interpret this error message and has an idea where his error is coming from? Regards, Tobias
Re: StackOverflow Problem with 1.3 mllib ALS
Fair enough but I'd say you hit that diminishing return after 20 iterations or so... :) On Thu, Apr 2, 2015 at 9:39 AM, Justin Yip yipjus...@gmail.com wrote: Thanks Xiangrui, I used 80 iterations to demonstrates the marginal diminishing return in prediction quality :) Justin On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote: I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1, 2015 at 11:54 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there any change to the ALS algorithm? And are there any ways to achieve more iterations? Thanks. Justin - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Starting httpd: http: Syntax error on line 154
I’m unable to access ganglia, it looks like due the web server not starting as I receive this error when I launch spark: Starting httpd: http: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so This occurs when I’m using the vanilla script. I’ve also tried modifying my spark-ec2 script in various ways in an effort to correct this problem including using different instance types and modifying the instance virtualization types. Thanks for any help!
StackOverflow Problem with 1.3 mllib ALS
Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there any change to the ALS algorithm? And are there any ways to achieve more iterations? Thanks. Justin
Setup Spark jobserver for Spark SQL
Hi, I am trying to Spark Jobserver( https://github.com/spark-jobserver/spark-jobserver https://github.com/spark-jobserver/spark-jobserver ) for running Spark SQL jobs. I was able to start the server but when I run my application(my Scala class which extends SparkSqlJob), I am getting the following as response: { status: ERROR, result: Invalid job type for this context } Can any one suggest me what is going wrong or provide a detailed procedure for setting up jobserver for SparkSQL? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setup-Spark-jobserver-for-Spark-SQL-tp22352.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame
I reproduced the bug on master and submitted a patch for it: https://github.com/apache/spark/pull/5329. It may get into Spark 1.3.1. Thanks for reporting the bug! -Xiangrui On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, I got the same error with the master. Here is another test example that fails. Here, I explicitly create a Row RDD which corresponds to the use case I am in : object TestDataFrame { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(TestDataFrame).setMaster(local[4]) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val data = Seq(LabeledPoint(1, Vectors.zeros(10))) val dataDF = sc.parallelize(data).toDF dataDF.printSchema() dataDF.save(test1.parquet) // OK val dataRow = data.map {case LabeledPoint(l: Double, f: mllib.linalg.Vector)= Row(l,f) } val dataRowRDD = sc.parallelize(dataRow) val dataDF2 = sqlContext.createDataFrame(dataRowRDD, dataDF.schema) dataDF2.printSchema() dataDF2.saveAsParquetFile(test3.parquet) // FAIL !!! } } On Tue, Mar 31, 2015 at 11:18 PM, Xiangrui Meng men...@gmail.com wrote: I cannot reproduce this error on master, but I'm not aware of any recent bug fixes that are related. Could you build and try the current master? -Xiangrui On Tue, Mar 31, 2015 at 4:10 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, DataFrame with an user defined type (here mllib.Vector) created with sqlContex.createDataFrame can't be saved to parquet file and raise ClassCastException: org.apache.spark.mllib.linalg.DenseVector cannot be cast to org.apache.spark.sql.Row error. Here is an example of code to reproduce this error : object TestDataFrame { def main(args: Array[String]): Unit = { //System.loadLibrary(Core.NATIVE_LIBRARY_NAME) val conf = new SparkConf().setAppName(RankingEval).setMaster(local[8]) .set(spark.executor.memory, 6g) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val data = sc.parallelize(Seq(LabeledPoint(1, Vectors.zeros(10 val dataDF = data.toDF dataDF.save(test1.parquet) val dataDF2 = sqlContext.createDataFrame(dataDF.rdd, dataDF.schema) dataDF2.save(test2.parquet) } } Is this related to https://issues.apache.org/jira/browse/SPARK-5532 and how can it be solved ? Cheers, Jao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: HiveContext setConf seems not stable
Hi, Jira created: https://issues.apache.org/jira/browse/SPARK-6675 Thank you. On Wed, Apr 1, 2015 at 7:50 PM, Michael Armbrust mich...@databricks.com wrote: Can you open a JIRA please? On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren inv...@gmail.com wrote: Hi, I find HiveContext.setConf does not work correctly. Here are some code snippets showing the problem: snippet 1: import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkConf, SparkContext} object Main extends App { val conf = new SparkConf() .setAppName(context-test) .setMaster(local[8]) val sc = new SparkContext(conf) val hc = new HiveContext(sc) *hc.setConf(spark.sql.shuffle.partitions, 10)* * hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test)* hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println } *Results:* (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) (spark.sql.shuffle.partitions,10) snippet 2: ... *hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test)* * hc.setConf(spark.sql.shuffle.partitions, 10)* hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println hc.getAllConfs filter(_._1.contains(shuffle.partitions)) foreach println ... *Results:* (hive.metastore.warehouse.dir,/user/hive/warehouse) (spark.sql.shuffle.partitions,10) *You can see that I just permuted the two setConf call, then that leads to two different Hive configuration.* *It seems that HiveContext can not set a new value on hive.metastore.warehouse.dir key in one or the first setConf call.* *You need another setConf call before changing hive.metastore.warehouse.dir. For example, set hive.metastore.warehouse.dir twice and the snippet 1* snippet 3: ... * hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test)* * hc.setConf(hive.metastore.warehouse.dir, /home/spark/hive/warehouse_test)* hc.getAllConfs filter(_._1.contains(warehouse.dir)) foreach println ... *Results:* (hive.metastore.warehouse.dir,/home/spark/hive/warehouse_test) *You can reproduce this if you move to the latest branch-1.3 (1.3.1-snapshot, htag = 7d029cb1eb6f1df1bce1a3f5784fb7ce2f981a33)* *I have also tested the released 1.3.0 (htag = 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc). It has the same problem.* *Please tell me if I am missing something. Any help is highly appreciated.* Hao -- Hao Ren {Data, Software} Engineer @ ClaraVista Paris, France -- Hao Ren {Data, Software} Engineer @ ClaraVista Paris, France
Re: From DataFrame to LabeledPoint
Peter's suggestion sounds good, but watch out for the match case since I believe you'll have to match on: case (Row(feature1, feature2, ...), Row(label)) = On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi try next code: val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case Row(feture1, feture2,..., label) = LabeledPoint(label, Vectors.dense(feature1, feature2, ...)) } Thanks, Peter Rudenko On 2015-04-02 17:17, drarse wrote: Hello!, I have a questions since days ago. I am working with DataFrame and with Spark SQL I imported a jsonFile: /val df = sqlContext.jsonFile(file.json)/ In this json I have the label and de features. I selected it: / val features = df.select (feature1,feature2,feature3,...); val labels = df.select (cassification)/ But, now, I don't know create a LabeledPoint for RandomForest. I tried some solutions without success. Can you help me? Thanks for all! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Generating a schema in Spark 1.3 failed while using DataTypes.
Do you have a full stack trace? On Thu, Apr 2, 2015 at 11:45 AM, ogoh oke...@gmail.com wrote: Hello, My ETL uses sparksql to generate parquet files which are served through Thriftserver using hive ql. It especially defines a schema programmatically since the schema can be only known at runtime. With spark 1.2.1, it worked fine (followed https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema ). I am trying to migrate into spark 1.3.0, but the API are confusing. I am not sure if the example of https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema is still valid on Spark1.3.0? For example, DataType.StringType is not there any more. Instead, I found DataTypes.StringType etc. So, I migrated as below and it builds fine. But at runtime, it throws Exception. I appreciate any help. Thanks, Okehee == Exception thrown java.lang.reflect.InvocationTargetException scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; my code's snippet import org.apache.spark.sql.types.DataTypes; DataTypes.createStructField(property, DataTypes.IntegerType, true) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-schema-in-Spark-1-3-failed-while-using-DataTypes-tp22362.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: persist(MEMORY_ONLY) takes lot of time
+1. Caching is way too slow. On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively. Few questions : 1. When I do the below (persist to memory after reading from disk), it takes lot of time to persist to memory, any suggestions of how to tune this? val inputP = sqlContext.parquetFile(some HDFS path) inputP.registerTempTable(sample_table) inputP.persist(MEMORY_ONLY) val result = sqlContext.sql(some sql query) result.count Note : Once the data is persisted to memory, it takes fraction of seconds to return query result from the second query onwards. So my concern is how to reduce the time when the data is first loaded to cache. 2. I have observed that if I omit the below line, inputP.persist(MEMORY_ONLY) the first time Query execution is comparatively quick (say it take 1min), as the load to Memory time is saved, but to my surprise the second time I run the same query it takes 30 sec as the inputP is not constructed from disk (checked from UI). So my question is, Does spark use some kind of internal caching for inputP in this scenario? Thanks in advance Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/persist-MEMORY-ONLY-takes-lot-of-time-tp22343.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Data locality across jobs
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records corresponding to a particular partition at the end of the first job can end up split across multiple partitions in the second job. -Sandy On Wed, Apr 1, 2015 at 9:09 PM, kjsingh kanwaljit.si...@guavus.com wrote: Hi, We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of Tuple2. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs. For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single executor rather than fetching it from others during shuffle. Is it possible to maintain key partitioning across jobs? We can control partitioning in one job. But how do we send keys to the executors of same node manager across jobs? And while saving data to HDFS, are the blocks allocated to the same data node machine as the executor for a partition? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-across-jobs-tp22351.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Reading a large file (binary) into RDD
Hm, that will indeed be trickier because this method assumes records are the same byte size. Is the file an arbitrary sequence of mixed types, or is there structure, e.g. short, long, short, long, etc.? If you could post a gist with an example of the kind of file and how it should look once read in that would be useful! - jeremyfreeman.net @thefreemanlab On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to read the entire file. On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: If it’s a flat binary file and each record is the same length (in bytes), you can use Spark’s binaryRecords method (defined on the SparkContext), which loads records from one or more large flat binary files into an RDD. Here’s an example in python to show how it works: # write data from an array from numpy import random dat = random.randn(100,5) f = open('test.bin', 'w') f.write(dat) f.close() # load the data back in from numpy import frombuffer nrecords = 5 bytesize = 8 recordsize = nrecords * bytesize data = sc.binaryRecords('test.bin', recordsize) parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) # these should be equal parsed.first() dat[0,:] Does that help? - jeremyfreeman.net @thefreemanlab On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line-by-line read of the file at the driver and constructing the RDD.
RE: Date and decimal datatype not working
Thanks all. Finally I am able to run my code successfully. It is running in Spark 1.2.1. I will try it on Spark 1.3 too. The major cause of all errors I faced was that the delimiter was not correctly declared. val TABLE_A = sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6))) Now I am using following and that solved most of the issues: val Delimeter = \\| val TABLE_A = sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(Delimeter)).map(p = ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6))) Thanks again. My first code ran successfully giving me some confidence, now I will explore more. Regards Ananda From: BASAK, ANANDA Sent: Thursday, March 26, 2015 4:55 PM To: Dean Wampler Cc: Yin Huai; user@spark.apache.org Subject: RE: Date and decimal datatype not working Thanks all. I am installing Spark 1.3 now. Thought that I should better sync with the daily evolution of this new technology. So once I install that, I will try to use the Spark-CSV library. Regards Ananda From: Dean Wampler [mailto:deanwamp...@gmail.com] Sent: Wednesday, March 25, 2015 1:17 PM To: BASAK, ANANDA Cc: Yin Huai; user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Date and decimal datatype not working Recall that the input isn't actually read until to do something that forces evaluation, like call saveAsTextFile. You didn't show the whole stack trace here, but it probably occurred while parsing an input line where one of your long fields is actually an empty string. Because this is such a common problem, I usually define a parse method that converts input text to the desired schema. It catches parse exceptions like this and reports the bad line at least. If you can return a default long in this case, say 0, that makes it easier to return something. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Editionhttp://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafehttp://typesafe.com @deanwamplerhttp://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 11:48 AM, BASAK, ANANDA ab9...@att.commailto:ab9...@att.com wrote: Thanks. This library is only available with Spark 1.3. I am using version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1. So I am using following: val MyDataset = sqlContext.sql(my select query”) MyDataset.map(t = t(0)+|+t(1)+|+t(2)+|+t(3)+|+t(4)+|+t(5)).saveAsTextFile(/my_destination_path) But it is giving following error: 15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 106) java.lang.NumberFormatException: For input string: at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:453) at java.lang.Long.parseLong(Long.java:483) at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230) is there something wrong with the TSTAMP field which is Long datatype? Thanks Regards --- Ananda Basak From: Yin Huai [mailto:yh...@databricks.commailto:yh...@databricks.com] Sent: Monday, March 23, 2015 8:55 PM To: BASAK, ANANDA Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Date and decimal datatype not working To store to csv file, you can use Spark-CSVhttps://github.com/databricks/spark-csv library. On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA ab9...@att.commailto:ab9...@att.com wrote: Thanks. This worked well as per your suggestions. I had to run following: val TABLE_A = sc.textFile(/Myhome/SPARK/files/table_a_file.txt).map(_.split(|)).map(p = ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)), BigDecimal(p(5)), BigDecimal(p(6 Now I am stuck at another step. I have run a SQL query, where I am Selecting from all the fields with some where clause , TSTAMP filtered with date range and order by TSTAMP clause. That is running fine. Then I am trying to store the output in a CSV file. I am using saveAsTextFile(“filename”) function. But it is giving error. Can you please help me to write a proper syntax to store output in a CSV file? Thanks Regards --- Ananda Basak From: BASAK, ANANDA Sent: Tuesday, March 17, 2015 3:08 PM To: Yin Huai Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: RE: Date and decimal datatype not working Ok, thanks for the suggestions. Let me try and will confirm all. Regards Ananda From: Yin Huai [mailto:yh...@databricks.commailto:yh...@databricks.com] Sent: Tuesday, March 17, 2015 3:04 PM To: BASAK, ANANDA Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Date and decimal datatype not working p(0) is a String. So, you need to explicitly convert it to a Long. e.g. p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, you need to create BigDecimal objects from your String values. On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA
Need a spark mllib tutorial
Hi, I am new to the spark MLLib and I was browsing through the internet for good tutorials advanced to the spark documentation example. But, I do not find any. Need help. Regards Phani Kumar
Re: Need a spark mllib tutorial
Here's one: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html Reza On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil) pyada...@cisco.com wrote: Hi, I am new to the spark MLLib and I was browsing through the internet for good tutorials advanced to the spark documentation example. But, I do not find any. Need help. Regards Phani Kumar
Re: Submitting to a cluster behind a VPN, configuring different IP address
I was able to hack this on my similar setup issue by running (on the driver) $ sudo hostname ip Where ip is the same value set in the spark.driver.host property. This isn't a solution I would use universally and hope the someone can fix this bug in the distribution. Regards, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22363.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Mllib kmeans #iteration
Check out the Spark docs for that parameter: *maxIterations* http://spark.apache.org/docs/latest/mllib-clustering.html#k-means On Thu, Apr 2, 2015 at 4:42 AM, podioss grega...@hotmail.com wrote: Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could run the algorithm with fixed number of iterations in some way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: input size too large | Performance issues with Spark
To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap. With fewer machines, try running 4 or 5 cores per executor and only 3-4 executors (1 per node): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Ought to reduce shuffle performance hit (someone else confirm?) #7 see default.shuffle.partitions (default: 200) On Sun, Mar 29, 2015 at 7:57 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Go through this once, if you haven't read it already. https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Sat, Mar 28, 2015 at 7:33 PM, nsareen nsar...@gmail.com wrote: Hi All, I'm facing performance issues with spark implementation, and was briefly investigating on WebUI logs, i noticed that my RDD size is 55GB the Shuffle Write is 10 GB Input Size is 200GB. Application is a web application which does predictive analytics, so we keep most of our data in memory. This observation was only for 30mins usage of the application on a single user. We anticipate atleast 10-15 users of the application sending requests in parallel, which makes me a bit nervous. One constraint we have is that we do not have too many nodes in a cluster, we may end up with 3-4 machines at best, but they can be scaled up vertically each having 24 cores / 512 GB ram etc. which can allow us to make a virtual 10-15 node cluster. Even then the input size shuffle write is too high for my liking. Any suggestions in this regard will be greatly appreciated as there aren't much resource on the net for handling performance issues such as these. Some pointers on my application's data structures design 1) RDD is a JavaPairRDD with Key / Value as CustomPOJO containing 3-4 Hashmaps Value containing 1 Hashmap 2) Data is loaded via JDBCRDD during application startup, which also tends to take a lot of time, since we massage the data once it is fetched from DB and then save it as JavaPairRDD. 3) Most of the data is structured, but we are still using JavaPairRDD, have not explored the option of Spark SQL though. 4) We have only one SparkContext which caters to all the requests coming into the application from various users. 5) During a single user session user can send 3-4 parallel stages consisting of Map / Group By / Join / Reduce etc. 6) We have to change the RDD structure using different types of group by operations since the user can do drill down drill up of the data ( aggregation at a higher / lower level). This is where we make use of Groupby's but there is a cost associated with this. 7) We have observed, that the initial RDD's we create have 40 odd partitions, but post some stage executions like groupby's the partitions increase to 200 or so, this was odd, and we havn't figured out why this happens. In summary we wan to use Spark to provide us the capability to process our in-memory data structure very fast as well as scale to a larger volume when required in the future. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/input-size-too-large-Performance-issues-with-Spark-tp22270.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Submitting to a cluster behind a VPN, configuring different IP address
yup a related JIRA is here https://issues.apache.org/jira/browse/SPARK-5113 which you might want to leave a comment in. This can be quite tricky we found ! but there are a host of env variable hacks you can use when launching spark masters/slaves. On Thu, Apr 2, 2015 at 5:18 PM, Michael Quinlan mq0...@gmail.com wrote: I was able to hack this on my similar setup issue by running (on the driver) $ sudo hostname ip Where ip is the same value set in the spark.driver.host property. This isn't a solution I would use universally and hope the someone can fix this bug in the distribution. Regards, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22363.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- jay vyas
RE: Spark SQL. Memory consumption
It is hard to say what could be reason without more detail information. If you provide some more information, maybe people here can help you better. 1) What is your worker's memory setting? It looks like that your nodes have 128G physical memory each, but what do you specify for the worker's heap size? If you can paste your spark-env.sh and spark-defaults.conf content here, it will be helpful.2) You are doing join with 2 tables. 8G parquet files is small, compared to the heap you gave. But is it for one table? 2 tables? Is the data compressed?3) Your join key is different as your grouping keys, so my assumption is that this query should lead to 4 stages (I could be wrong, as I am kind of new to Spark SQL too). Is that right? If so, on what stage the OOM happened? With this information, it can help us to better judge which part caused OOM.4) When you set the spark.shuffle.partitions to 1024, did the stage 3 and 4 really create 1024 tasks? 5) When the OOM happens, at least you can past the stack track of OOM, so it will help people here to guess which part of Spark leads to the OOM, so give you better suggests. Thanks Yong Date: Thu, 2 Apr 2015 17:46:48 +0200 Subject: Spark SQL. Memory consumption From: masfwo...@gmail.com To: user@spark.apache.org Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7),MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) ,SUM(IF(field13 = 2012 , 1, 0)) ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0)) FROM table1 CL JOIN table2 netwON CL.field15 = netw.id WHERE AND field3 IS NOT NULL AND field4 IS NOT NULL AND field5 IS NOT NULL GROUP BY field1,field2,field3,field4, netw.field5 spark-submit --master spark://master:7077 --driver-memory 20g --executor-memory 60g --class GMain project_2.10-1.0.jar --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' 2 ./error Input data is 8GB in parquet format. Many times crash by GC overhead. I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB RAM/node) is collapsed. Is it a query too difficult to Spark SQL? Would It be better to do it in Spark?Am I doing something wrong? Thanks-- Regards. Miguel Ángel
RE: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));
Michael, thanks for the response and looking forward to try 1.3.1 From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 03, 2015 6:52 AM To: Haopu Wang Cc: user Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q)); Thanks for reporting. The root cause is (SPARK-5632 https://issues.apache.org/jira/browse/SPARK-5632 ), which is actually pretty hard to fix. Fortunately, for this particular case there is an easy workaround: https://github.com/apache/spark/pull/5337 We can try to include this in 1.3.1. On Thu, Apr 2, 2015 at 3:29 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, I want to rename an aggregation field using DataFrame API. The aggregation is done on a nested field. But I got below exception. Do you see the same issue and any workaround? Thank you very much! == Exception in thread main org.apache.spark.sql.AnalysisException: Cannot resolve column name SUM('p.q) among (k, SUM('p.q)); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala: 162) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala: 162) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:244) at org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:243) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.s cala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.DataFrame.toDF(DataFrame.scala:243) == And this code can be used to reproduce the issue: case class ChildClass(q: Long) case class ParentClass(k: String, p: ChildClass) def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(DFTest).setMaster(local[*]) val ctx = new SparkContext(conf) val sqlCtx = new HiveContext(ctx) import sqlCtx.implicits._ val source = ctx.makeRDD(Seq(ParentClass(c1, ChildClass(100.toDF import org.apache.spark.sql.functions._ val target = source.groupBy('k).agg('k, sum(p.q)) // This line prints the correct contents // k SUM('p.q) // c1 100 target.show // But this line triggers the exception target.toDF(key, total) ==
Re: Cannot run the example in the Spark 1.3.0 following the document
Hi, there you may need to add : import sqlContext.implicits._ Best, Sun fightf...@163.com From: java8964 Date: 2015-04-03 10:15 To: user@spark.apache.org Subject: Cannot run the example in the Spark 1.3.0 following the document I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this: // Select everybody, but increment the age by 1 df.select(name, df(age) + 1).show() // name(age + 1) // Michael null // Andy31 // Justin 20 But what I got on my Spark 1.3.0 is the following error: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c845f64 scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala df.printSchema root |-- age: long (nullable = true) |-- name: string (nullable = true)scala df.select(name, df(age) + 1).show() console:30: error: overloaded method value select with alternatives: (col: String,cols: String*)org.apache.spark.sql.DataFrame and (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be applied to (String, org.apache.spark.sql.Column) df.select(name, df(age) + 1).show() ^ Is this a bug in Spark 1.3.0, or my build having some problem? Thanks
Re: Generating a schema in Spark 1.3 failed while using DataTypes.
This looks to me like you have incompatible versions of scala on your classpath? On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh oke...@gmail.com wrote: yes, below is the stacktrace. Thanks, Okehee java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97) at scala.reflect.internal.StdNames$Keywords.init(StdNames.scala:203) at scala.reflect.internal.StdNames$TermNames.init(StdNames.scala:288) at scala.reflect.internal.StdNames$nme$.init(StdNames.scala:1045) at scala.reflect.internal.SymbolTable.nme$lzycompute(SymbolTable.scala:16) at scala.reflect.internal.SymbolTable.nme(SymbolTable.scala:16) at scala.reflect.internal.StdNames$class.$init$(StdNames.scala:1041) at scala.reflect.internal.SymbolTable.init(SymbolTable.scala:16) at scala.reflect.runtime.JavaUniverse.init(JavaUniverse.scala:16) at scala.reflect.runtime.package$.universe$lzycompute(package.scala:17) at scala.reflect.runtime.package$.universe(package.scala:17) at org.apache.spark.sql.types.NativeType.init(dataTypes.scala:337) at org.apache.spark.sql.types.StringType.init(dataTypes.scala:351) at org.apache.spark.sql.types.StringType$.init(dataTypes.scala:367) at org.apache.spark.sql.types.StringType$.clinit(dataTypes.scala) at org.apache.spark.sql.types.DataTypes.clinit(DataTypes.java:30) at com.quixey.dataengine.dataprocess.parser.ToTableRecord.generateTableSchemaForSchemaRDD(ToTableRecord.java:282) at com.quixey.dataengine.dataprocess.parser.ToUDMTest.generateTableSchemaTest(ToUDMTest.java:132) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85) at org.testng.internal.Invoker.invokeMethod(Invoker.java:696) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108) at org.testng.TestRunner.privateRun(TestRunner.java:767) at org.testng.TestRunner.run(TestRunner.java:617) at org.testng.SuiteRunner.runTest(SuiteRunner.java:348) at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343) at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305) at org.testng.SuiteRunner.run(SuiteRunner.java:254) at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224) at org.testng.TestNG.runSuitesLocally(TestNG.java:1149) at org.testng.TestNG.run(TestNG.java:1057) at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:115) at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32) at org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93) at com.sun.proxy.$Proxy2.stop(Unknown Source) at org.gradle.api.internal.tasks.testing.worker.TestWorker.stop(TestWorker.java:115) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:355) at
Cannot run the example in the Spark 1.3.0 following the document
I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this:// Select everybody, but increment the age by 1 df.select(name, df(age) + 1).show() // name(age + 1) // Michael null // Andy31 // Justin 20 But what I got on my Spark 1.3.0 is the following error: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c845f64 scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala df.printSchema root |-- age: long (nullable = true) |-- name: string (nullable = true)scala df.select(name, df(age) + 1).show() console:30: error: overloaded method value select with alternatives: (col: String,cols: String*)org.apache.spark.sql.DataFrame and (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be applied to (String, org.apache.spark.sql.Column) df.select(name, df(age) + 1).show() ^ Is this a bug in Spark 1.3.0, or my build having some problem? Thanks
RE: ArrayBuffer within a DataFrame
Hint: DF.rdd.map{} Mohammed From: Denny Lee [mailto:denny.g@gmail.com] Sent: Thursday, April 2, 2015 7:10 PM To: user@spark.apache.org Subject: ArrayBuffer within a DataFrame Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!
Re: Spark Streaming Worker runs out of inodes
Yes, with spark.cleaner.ttl set there is no cleanup. We pass --properties-file spark-dev.conf to spark-submit where spark-dev.conf contains: spark.master spark://10.250.241.66:7077 spark.logConf true spark.cleaner.ttl 1800 spark.executor.memory 10709m spark.cores.max 4 spark.shuffle.consolidateFiles true On Thu, Apr 2, 2015 at 7:12 PM, Tathagata Das t...@databricks.com wrote: Are you saying that even with the spark.cleaner.ttl set your files are not getting cleaned up? TD On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote: Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see tons of old shuffle_*.data and *.index files that are never deleted. How do we get Spark to remove these files? We have a simple standalone app with one RabbitMQ receiver and a two node cluster (2 x r3large AWS instances). Batch interval is 10 minutes after which we process data and write results to DB. No windowing or state mgmt is used. I've poured over the documentation and tried setting the following properties but they have not helped. As a work around we're using a cron script that periodically cleans up old files but this has a bad smell to it. SPARK_WORKER_OPTS in spark-env.sh on every worker node spark.worker.cleanup.enabled true spark.worker.cleanup.interval spark.worker.cleanup.appDataTtl Also tried on the driver side: spark.cleaner.ttl spark.shuffle.consolidateFiles true -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Generating a schema in Spark 1.3 failed while using DataTypes.
Michael, You are right. The build brought org.scala-lang:scala-library:2.10.1 from other package (as below). It works fine after excluding the old scala version. Thanks a lot, Okehee == dependency: |+--- org.apache.kafka:kafka_2.10:0.8.1.1 ||+--- com.yammer.metrics:metrics-core:2.2.0 |||\--- org.slf4j:slf4j-api:1.7.2 - 1.7.7 ||+--- org.xerial.snappy:snappy-java:1.0.5 ||+--- org.apache.zookeeper:zookeeper:3.3.4 - 3.4.5 |||+--- org.slf4j:slf4j-api:1.6.1 - 1.7.7 |||+--- log4j:log4j:1.2.15 |||+--- jline:jline:0.9.94 |||\--- org.jboss.netty:netty:3.2.2.Final ||+--- net.sf.jopt-simple:jopt-simple:3.2 - 4.6 ||+--- org.scala-lang:scala-library:2.10.1 On Thu, Apr 2, 2015 at 4:45 PM, Michael Armbrust mich...@databricks.com wrote: This looks to me like you have incompatible versions of scala on your classpath? On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh oke...@gmail.com wrote: yes, below is the stacktrace. Thanks, Okehee java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97) at scala.reflect.internal.StdNames$Keywords.init(StdNames.scala:203) at scala.reflect.internal.StdNames$TermNames.init(StdNames.scala:288) at scala.reflect.internal.StdNames$nme$.init(StdNames.scala:1045) at scala.reflect.internal.SymbolTable.nme$lzycompute(SymbolTable.scala:16) at scala.reflect.internal.SymbolTable.nme(SymbolTable.scala:16) at scala.reflect.internal.StdNames$class.$init$(StdNames.scala:1041) at scala.reflect.internal.SymbolTable.init(SymbolTable.scala:16) at scala.reflect.runtime.JavaUniverse.init(JavaUniverse.scala:16) at scala.reflect.runtime.package$.universe$lzycompute(package.scala:17) at scala.reflect.runtime.package$.universe(package.scala:17) at org.apache.spark.sql.types.NativeType.init(dataTypes.scala:337) at org.apache.spark.sql.types.StringType.init(dataTypes.scala:351) at org.apache.spark.sql.types.StringType$.init(dataTypes.scala:367) at org.apache.spark.sql.types.StringType$.clinit(dataTypes.scala) at org.apache.spark.sql.types.DataTypes.clinit(DataTypes.java:30) at com.quixey.dataengine.dataprocess.parser.ToTableRecord.generateTableSchemaForSchemaRDD(ToTableRecord.java:282) at com.quixey.dataengine.dataprocess.parser.ToUDMTest.generateTableSchemaTest(ToUDMTest.java:132) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85) at org.testng.internal.Invoker.invokeMethod(Invoker.java:696) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108) at org.testng.TestRunner.privateRun(TestRunner.java:767) at org.testng.TestRunner.run(TestRunner.java:617) at org.testng.SuiteRunner.runTest(SuiteRunner.java:348) at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343) at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305) at org.testng.SuiteRunner.run(SuiteRunner.java:254) at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224) at org.testng.TestNG.runSuitesLocally(TestNG.java:1149) at org.testng.TestNG.run(TestNG.java:1057) at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:115) at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32) at
maven compile error
Hi,all: Just now i checked out spark-1.2 on github , wanna to build it use maven, how ever I encountered an error during compiling: [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-catalyst_2.10: wrap: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-catalyst_2.10: wrap: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) Caused by: org.apache.maven.plugin.MojoExecutionException: wrap: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:490) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:101) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209) ... 19 more Caused by: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172) at scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:183) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:183) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:184) at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:184) at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1024) at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1023) at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1153) at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1152) at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1196) at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1196) at scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1261) at scala.tools.nsc.Global$Run.init(Global.scala:1290) at xsbt.CachedCompiler0$$anon$2.init(CompilerInterface.scala:113) at xsbt.CachedCompiler0.run(CompilerInterface.scala:113) at xsbt.CachedCompiler0.run(CompilerInterface.scala:99) at xsbt.CompilerInterface.run(CompilerInterface.scala:27) at
[SQL] Simple DataFrame questions
Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying to use the syntax used in various blog posts but can't figure out how to reference a column by name: scala df.filter(customer_id!=) console:23: error: overloaded method value filter with alternatives: (conditionExpr: String)org.apache.spark.sql.DataFrame and (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be applied to (Boolean) df.filter(customer_id!=) 3. what would be the recommended way to drop a row containing a null value -- is it possible to do this: scala df.filter(customer_id IS NOT NULL)
RE: Cannot run the example in the Spark 1.3.0 following the document
The import command already run. Forgot the mention, the rest of examples related to df all works, just this one caused problem. Thanks Yong Date: Fri, 3 Apr 2015 10:36:45 +0800 From: fightf...@163.com To: java8...@hotmail.com; user@spark.apache.org Subject: Re: Cannot run the example in the Spark 1.3.0 following the document Hi, there you may need to add : import sqlContext.implicits._ Best,Sun fightf...@163.com From: java8964Date: 2015-04-03 10:15To: user@spark.apache.orgSubject: Cannot run the example in the Spark 1.3.0 following the document I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this:// Select everybody, but increment the age by 1 df.select(name, df(age) + 1).show() // name(age + 1) // Michael null // Andy31 // Justin 20 But what I got on my Spark 1.3.0 is the following error: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43)scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c845f64 scala val df = sqlContext.jsonFile(/user/yzhang/people.json)df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]scala df.printSchema root |-- age: long (nullable = true) |-- name: string (nullable = true)scala df.select(name, df(age) + 1).show() console:30: error: overloaded method value select with alternatives: (col: String,cols: String*)org.apache.spark.sql.DataFrame and (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be applied to (String, org.apache.spark.sql.Column) df.select(name, df(age) + 1).show() ^ Is this a bug in Spark 1.3.0, or my build having some problem? Thanks
Re: Generating a schema in Spark 1.3 failed while using DataTypes.
yes, below is the stacktrace. Thanks, Okehee java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97) at scala.reflect.internal.StdNames$Keywords.init(StdNames.scala:203) at scala.reflect.internal.StdNames$TermNames.init(StdNames.scala:288) at scala.reflect.internal.StdNames$nme$.init(StdNames.scala:1045) at scala.reflect.internal.SymbolTable.nme$lzycompute(SymbolTable.scala:16) at scala.reflect.internal.SymbolTable.nme(SymbolTable.scala:16) at scala.reflect.internal.StdNames$class.$init$(StdNames.scala:1041) at scala.reflect.internal.SymbolTable.init(SymbolTable.scala:16) at scala.reflect.runtime.JavaUniverse.init(JavaUniverse.scala:16) at scala.reflect.runtime.package$.universe$lzycompute(package.scala:17) at scala.reflect.runtime.package$.universe(package.scala:17) at org.apache.spark.sql.types.NativeType.init(dataTypes.scala:337) at org.apache.spark.sql.types.StringType.init(dataTypes.scala:351) at org.apache.spark.sql.types.StringType$.init(dataTypes.scala:367) at org.apache.spark.sql.types.StringType$.clinit(dataTypes.scala) at org.apache.spark.sql.types.DataTypes.clinit(DataTypes.java:30) at com.quixey.dataengine.dataprocess.parser.ToTableRecord.generateTableSchemaForSchemaRDD(ToTableRecord.java:282) at com.quixey.dataengine.dataprocess.parser.ToUDMTest.generateTableSchemaTest(ToUDMTest.java:132) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85) at org.testng.internal.Invoker.invokeMethod(Invoker.java:696) at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:882) at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1189) at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:124) at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108) at org.testng.TestRunner.privateRun(TestRunner.java:767) at org.testng.TestRunner.run(TestRunner.java:617) at org.testng.SuiteRunner.runTest(SuiteRunner.java:348) at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:343) at org.testng.SuiteRunner.privateRun(SuiteRunner.java:305) at org.testng.SuiteRunner.run(SuiteRunner.java:254) at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52) at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86) at org.testng.TestNG.runSuitesSequentially(TestNG.java:1224) at org.testng.TestNG.runSuitesLocally(TestNG.java:1149) at org.testng.TestNG.run(TestNG.java:1057) at org.gradle.api.internal.tasks.testing.testng.TestNGTestClassProcessor.stop(TestNGTestClassProcessor.java:115) at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.stop(SuiteTestClassProcessor.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32) at org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93) at com.sun.proxy.$Proxy2.stop(Unknown Source) at org.gradle.api.internal.tasks.testing.worker.TestWorker.stop(TestWorker.java:115) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:355) at org.gradle.internal.concurrent.DefaultExecutorFactory$StoppableExecutorImpl$1.run(DefaultExecutorFactory.java:64)
Re: Spark SQL does not read from cached table if table is renamed
I'll add we just back ported this so it'll be included in 1.2.2 also. On Wed, Apr 1, 2015 at 4:14 PM, Michael Armbrust mich...@databricks.com wrote: This is fixed in Spark 1.3. https://issues.apache.org/jira/browse/SPARK-5195 On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Hi all, Noticed a bug in my current version of Spark 1.2.1. After a table is cached with “cache table table” command, query will not read from memory if SQL query renames the table. This query reads from in memory table i.e. select hivesampletable.country from default.hivesampletable group by hivesampletable.country This query with renamed table reads from hive i.e. select table.country from default.hivesampletable table group by table.country Is this a known bug? Most BI tools rename tables to avoid table name collision. Thanks, Judy
Re: Spark-events does not exist error, while it does with all the req. rights
FYI I wrote a small test to try to reproduce this, and filed SPARK-6688 to track the fix. On Tue, Mar 31, 2015 at 1:15 PM, Marcelo Vanzin van...@cloudera.com wrote: Hmmm... could you try to set the log dir to file:/home/hduser/spark/spark-events? I checked the code and it might be the case that the behaviour changed between 1.2 and 1.3... On Mon, Mar 30, 2015 at 6:44 PM, Tom Hubregtsen thubregt...@gmail.com wrote: The stack trace for the first scenario and your suggested improvement is similar, with as only difference the first line (Sorry for not including this): Log directory /home/hduser/spark/spark-events does not exist. To verify your premises, I cd'ed into the directory by copy pasting the path listed in the error message (i, ii), created a text file, closed it an viewed it, and deleted it (iii). My findings were reconfirmed by my colleague. Any other ideas? Thanks, Tom On 30 March 2015 at 19:19, Marcelo Vanzin van...@cloudera.com wrote: So, the error below is still showing the invalid configuration. You mentioned in the other e-mails that you also changed the configuration, and that the directory really, really exists. Given the exception below, the only ways you'd get the error with a valid configuration would be if (i) the directory didn't exist, (ii) it existed but the user could not navigate to it or (iii) it existed but was not actually a directory. So please double-check all that. On Mon, Mar 30, 2015 at 5:11 PM, Tom Hubregtsen thubregt...@gmail.com wrote: Stack trace: 15/03/30 17:37:30 INFO storage.BlockManagerMaster: Registered BlockManager Exception in thread main java.lang.IllegalArgumentException: Log directory ~/spark/spark-events does not exist. -- Marcelo -- Marcelo -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Reading a large file (binary) into RDD
I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a specific structure. I outline it below. The input file is basically a representation of a graph. INTINT(A)LONG (B)A INTs(Degrees)A SHORTINTs (Vertex_Attribute)B INTsB INTsB SHORTINTsB SHORTINTs A - number of verticesB - number of edges (note that the INTs/SHORTINTs associated with this are edge attributes) After reading in the file, I need to create two RDDs (one with vertices and the other with edges) On Thu, Apr 2, 2015 at 4:46 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hm, that will indeed be trickier because this method assumes records are the same byte size. Is the file an arbitrary sequence of mixed types, or is there structure, e.g. short, long, short, long, etc.? If you could post a gist with an example of the kind of file and how it should look once read in that would be useful! - jeremyfreeman.net @thefreemanlab On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to read the entire file. On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: If it’s a flat binary file and each record is the same length (in bytes), you can use Spark’s binaryRecords method (defined on the SparkContext), which loads records from one or more large flat binary files into an RDD. Here’s an example in python to show how it works: # write data from an arrayfrom numpy import randomdat = random.randn(100,5)f = open('test.bin', 'w')f.write(dat)f.close() # load the data back infrom numpy import frombuffernrecords = 5bytesize = 8recordsize = nrecords * bytesizedata = sc.binaryRecords('test.bin', recordsize)parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) # these should be equalparsed.first()dat[0,:] Does that help? - jeremyfreeman.net @thefreemanlab On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line-by-line read of the file at the driver and constructing the RDD.
Re: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available
Hi Young, Sorry for the duplicate post, want to reply to all. I just downloaded the bits prebuilt form apache spark download site. Started the spark shell and got the same error. I then started the shell as follows: ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar --jars $(echo ~/Downloads/apache-hive-0.13.1-bin/lib/*.jar | tr ' ' ',') this worked, or at least got rid of this scala case class MetricTable(path: String, pathElements: String, name: String, value: String) scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveMetastoreCatalog .class. That entry seems to have slain the compiler. Shall I replay your session? I can re-run each line except the last one. [y/n] Still getting the ClassNotFoundException, json_tuple, from this statement same as in 1.2.1: sql( SELECT path, name, value, v1.peValue, v1.peName FROM metric_table lateral view json_tuple(pathElements, 'name', 'value') v1 as peName, peValue ) .collect.foreach(println(_)) 15/04/02 20:50:14 INFO ParseDriver: Parsing command: SELECT path, name, value, v1.peValue, v1.peName FROM metric_table lateral view json_tuple(pathElements, 'name', 'value') v1 as peName, peValue15/04/02 20:50:14 INFO ParseDriver: Parse Completed java.lang.ClassNotFoundException: json_tuple at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) Any ideas on the json_tuple exception? Modified the syntax to take into account some minor changes in 1.3. The one posted this morning was from my 1.2.1 test. import sqlContext.implicits._case class MetricTable(path: String, pathElements: String, name: String, value: String)val mt = new MetricTable(path: /DC1/HOST1/, pathElements: [{node: DataCenter,value: DC1},{node: host,value: HOST1}], name: Memory Usage (%), value: 29.590943279257175)val rdd1 = sc.makeRDD(List(mt))val df = rdd1.toDF df.printSchema df.show df.registerTempTable(metric_table) sql( SELECT path, name, value, v1.peValue, v1.peName FROM metric_table lateral view json_tuple(pathElements, 'name', 'value') v1 as peName, peValue ) .collect.foreach(println(_)) On Thu, Apr 2, 2015 at 8:21 PM, java8964 java8...@hotmail.com wrote: Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I cannot reproduce it on Spark 1.2.1 If we check the code change below: Spark 1.3 branch https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala vs Spark 1.2 branch https://github.com/apache/spark/blob/branch-1.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala You can see that on line 24: import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache} is introduced on 1.3 branch. The error basically mean runtime com.google.common.cache package cannot be found in the classpath. Either you and me made the same mistake when we build Spark 1.3.0, or there are something wrong with Spark 1.3 pom.xml file. Here is how I built the 1.3.0: 1) Download the spark 1.3.0 source 2) make-distribution --targz -Dhadoop.version=1.1.1 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests Is this only due to that I built against Hadoop 1.x? Yong -- Date: Thu, 2 Apr 2015 13:56:33 -0400 Subject: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available From: tsind...@gmail.com To: user@spark.apache.org I was trying a simple test from the spark-shell to see if 1.3.0 would address a problem I was having with locating the json_tuple class and got the following error: scala import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive._ scala val sqlContext = new HiveContext(sc)sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@79c849c7 scala import sqlContext._ import sqlContext._ scala case class MetricTable(path: String, pathElements: String, name: String, value: String)scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in HiveMetastoreCatalog.class refers to term cachein package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveMetastoreCatalog.class. That entry seems to have slain
Re: [SQL] Simple DataFrame questions
For cast, you can use selectExpr method. For example, df.selectExpr(cast(col1 as int) as col1, cast(col2 as bigint) as col2). Or, df.select(df(colA).cast(int), ...) On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust mich...@databricks.com wrote: val df = Seq((test, 1)).toDF(col1, col2) You can use SQL style expressions as a string: df.filter(col1 IS NOT NULL).collect() res1: Array[org.apache.spark.sql.Row] = Array([test,1]) Or you can also reference columns using df(colName) or quot;colName or col(colName) df.filter(df(col1) === test).collect() res2: Array[org.apache.spark.sql.Row] = Array([test,1]) On Thu, Apr 2, 2015 at 7:45 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying to use the syntax used in various blog posts but can't figure out how to reference a column by name: scala df.filter(customer_id!=) console:23: error: overloaded method value filter with alternatives: (conditionExpr: String)org.apache.spark.sql.DataFrame and (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be applied to (Boolean) df.filter(customer_id!=) 3. what would be the recommended way to drop a row containing a null value -- is it possible to do this: scala df.filter(customer_id IS NOT NULL)
Re: Spark Streaming Worker runs out of inodes
Are you saying that even with the spark.cleaner.ttl set your files are not getting cleaned up? TD On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote: Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see tons of old shuffle_*.data and *.index files that are never deleted. How do we get Spark to remove these files? We have a simple standalone app with one RabbitMQ receiver and a two node cluster (2 x r3large AWS instances). Batch interval is 10 minutes after which we process data and write results to DB. No windowing or state mgmt is used. I've poured over the documentation and tried setting the following properties but they have not helped. As a work around we're using a cron script that periodically cleans up old files but this has a bad smell to it. SPARK_WORKER_OPTS in spark-env.sh on every worker node spark.worker.cleanup.enabled true spark.worker.cleanup.interval spark.worker.cleanup.appDataTtl Also tried on the driver side: spark.cleaner.ttl spark.shuffle.consolidateFiles true -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available
Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I cannot reproduce it on Spark 1.2.1 If we check the code change below: Spark 1.3 branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala vs Spark 1.2 branchhttps://github.com/apache/spark/blob/branch-1.2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala You can see that on line 24: import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache} is introduced on 1.3 branch. The error basically mean runtime com.google.common.cache package cannot be found in the classpath. Either you and me made the same mistake when we build Spark 1.3.0, or there are something wrong with Spark 1.3 pom.xml file. Here is how I built the 1.3.0: 1) Download the spark 1.3.0 source2) make-distribution --targz -Dhadoop.version=1.1.1 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests Is this only due to that I built against Hadoop 1.x? Yong Date: Thu, 2 Apr 2015 13:56:33 -0400 Subject: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available From: tsind...@gmail.com To: user@spark.apache.org I was trying a simple test from the spark-shell to see if 1.3.0 would address a problem I was having with locating the json_tuple class and got the following error: scala import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive._ scala val sqlContext = new HiveContext(sc) sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@79c849c7 scala import sqlContext._ import sqlContext._ scala case class MetricTable(path: String, pathElements: String, name: String, value: String) scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature in HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling HiveMetastoreCatalog.class. That entry seems to have slain the compiler. Shall I replay your session? I can re-run each line except the last one. [y/n] Abandoning crashed session.I entered the shell as follows:./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 --driver-class-path /usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jarhive-site.xml looks like this:?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namehive.semantic.analyzer.factory.impl/name valueorg.apache.hcatalog.cli.HCatSemanticAnalyzerFactory/value /property property namehive.metastore.sasl.enabled/name valuefalse/value /property property namehive.server2.authentication/name valueNONE/value /property property namehive.server2.enable.doAs/name valuetrue/value /property property namehive.warehouse.subdir.inherit.perms/name valuetrue/value /property property namehive.metastore.schema.verification/name valuefalse/value /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true/value descriptionmetadata is stored in a MySQL server/description /property property namejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value descriptionMySQL JDBC driver class/description /property property namejavax.jdo.option.ConnectionUserName/name value***/value /property property namejavax.jdo.option.ConnectionPassword/name value/value /property /configurationI have downloaded a clean version of 1.3.0 and tried it again but same error. Is this a know issue? Or a configuration issue on my part?TIA for the assistances.-Todd
ArrayBuffer within a DataFrame
Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!
回复:How to learn Spark ?
The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download-deploy it in standalone mode-run some examples-try cluster deploy mode- then try to develop your own app and deploy it in your spark cluster. and it's better to learn scala well if you wanna dive into spark. Also there are some books about spark. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Star Guo st...@ceph.me 收件人:user@spark.apache.org 主题:How to learn Spark ? 日期:2015年04月02日 16点19分 Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo
Re: Connection pooling in spark jobs
http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Error in SparkSQL/Scala IDE
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection in the Spark SQL library. Make sure it's in the classpath and the version is correct, too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 8:39 AM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hi Everyone, I am getting following error while registering table using Scala IDE. Please let me know how to resolve this error. I am using Spark 1.2.1 import sqlContext.createSchemaRDD val empFile = sc.textFile(/tmp/emp.csv, 4) .map ( _.split(,) ) .map( row= Employee(row(0),row(1), row(2), row(3), row( 4))) empFile.registerTempTable(Employees) Thanks Sathish Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-library.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-reflect.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-actor.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-swing.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-compiler.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/classes] not found. at scala.reflect.internal.MissingRequirementError$.signal( MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound( MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass( Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass( Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass( Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply( ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute( TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor( ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor( ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor( ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor( ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor( ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor( ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.svairavelu.examples.QueryCSV$.main(QueryCSV.scala:24) at com.svairavelu.examples.QueryCSV.main(QueryCSV.scala)
Re: there are about 50% all-zero vector in the als result
yes! thank you very much:-) 在 2015年4月2日,下午7:13,Sean Owen so...@cloudera.com 写道: Right, I asked because in your original message, you were looking at the initialization to a random vector. But that is the initial state, not final state. On Thu, Apr 2, 2015 at 11:51 AM, lisendong lisend...@163.com wrote: NO, I’m referring to the result. you means there might be so many zero features in the als result ? I think it is not related to the initial state, but I do not know why the percent of zero-vector is so high(50% around) I looked into the ALS.scala, the user and product factors seems to be initialized by a gaussian distribution, so it should not be all-zero vector, right? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Error in SparkSQL/Scala IDE
Hi Everyone, I am getting following error while registering table using Scala IDE. Please let me know how to resolve this error. I am using Spark 1.2.1 import sqlContext.createSchemaRDD val empFile = sc.textFile(/tmp/emp.csv, 4) .map ( _.split(,) ) .map( row= Employee(row(0),row(1), row(2), row(3), row(4 ))) empFile.registerTempTable(Employees) Thanks Sathish Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-library.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-reflect.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-actor.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-swing.jar:/Applications/eclipse/plugins/org.scala-ide.scala210.jars_4.0.0.201412161056/target/jars/scala-compiler.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.7.0_65.jdk/Contents/Home/jre/classes] not found. at scala.reflect.internal.MissingRequirementError$.signal( MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound( MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass( Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass( Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass( Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply( ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute( TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor( ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor( ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor( ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor( ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor( ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor( ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.svairavelu.examples.QueryCSV$.main(QueryCSV.scala:24) at com.svairavelu.examples.QueryCSV.main(QueryCSV.scala)
Re: Connection pooling in spark jobs
Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote: http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh
[SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));
Hi, I want to rename an aggregation field using DataFrame API. The aggregation is done on a nested field. But I got below exception. Do you see the same issue and any workaround? Thank you very much! == Exception in thread main org.apache.spark.sql.AnalysisException: Cannot resolve column name SUM('p.q) among (k, SUM('p.q)); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala: 162) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala: 162) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) at org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:244) at org.apache.spark.sql.DataFrame$$anonfun$3.apply(DataFrame.scala:243) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.sc ala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.s cala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.DataFrame.toDF(DataFrame.scala:243) == And this code can be used to reproduce the issue: case class ChildClass(q: Long) case class ParentClass(k: String, p: ChildClass) def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(DFTest).setMaster(local[*]) val ctx = new SparkContext(conf) val sqlCtx = new HiveContext(ctx) import sqlCtx.implicits._ val source = ctx.makeRDD(Seq(ParentClass(c1, ChildClass(100.toDF import org.apache.spark.sql.functions._ val target = source.groupBy('k).agg('k, sum(p.q)) // This line prints the correct contents // k SUM('p.q) // c1 100 target.show // But this line triggers the exception target.toDF(key, total) ==
Re: Error reading smallin in hive table with parquet format
No, in my company are using cloudera distributions and 1.2.0 is the last version of spark. Thanks On Wed, Apr 1, 2015 at 8:08 PM, Michael Armbrust mich...@databricks.com wrote: Can you try with Spark 1.3? Much of this code path has been rewritten / improved in this version. On Wed, Apr 1, 2015 at 7:53 AM, Masf masfwo...@gmail.com wrote: Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS SELECT field1 FROM table1 *field1 is SMALLINT. If table1 is in text format all it's ok, but if table1 is in parquet format, spark returns the following error*: 15/04/01 16:48:24 ERROR TaskSetManager: Task 26 in stage 1.0 failed 1 times; aborting job Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in stage 1.0 failed 1 times, most recent failure: Lost task 26.0 in stage 1.0 (TID 28, localhost): java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Short at org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:251) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Thanks! -- Regards. Miguel Ángel -- Saludos. Miguel Ángel
Matrix Transpose
Hello! I have a CSV file that has the following content: C1;C2;C3 11;22;33 12;23;34 13;24;35 What is the best approach to use Spark (API, MLLib) for achieving the transpose of it? C1 11 12 13 C2 22 23 24 C3 33 34 35 I look forward for your solutions and suggestions (some Scala code will be really helpful). Thanks. Florin P.S. In reality my matrix has more than 1000 columns and more than 1 million rows.
Re: there are about 50% all-zero vector in the als result
NO, I’m referring to the result. you means there might be so many zero features in the als result ? I think it is not related to the initial state, but I do not know why the percent of zero-vector is so high(50% around) 在 2015年4月2日,下午6:08,Sean Owen so...@cloudera.com 写道: You're referring to the initialization, not the result, right? It's possible that the resulting weight vectors are sparse although this looks surprising to me. But it is not related to the initial state, right? On Thu, Apr 2, 2015 at 10:43 AM, lisendong lisend...@163.com mailto:lisend...@163.com wrote: I found that there are about 50% all-zero vectors in the ALS result ( both in userFeatures and productFeatures) the result looks like this: PastedGraphic-1.tiff I looked into the ALS.scala, the user and product factors seems to be initialized by a gaussian distribution, so it should not be all-zero vector, right? 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。 共有 1 个附件 PastedGraphic-1.tiff(607K) 极速下载 http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiffmid=1tbiDQnPDVQG6QKGSwAAsypart=3sign=60d32e9f90d3dd36c5858328dd96eabetime=1427969402uid=lisendong%40163.com
Re: there are about 50% all-zero vector in the als result
Right, I asked because in your original message, you were looking at the initialization to a random vector. But that is the initial state, not final state. On Thu, Apr 2, 2015 at 11:51 AM, lisendong lisend...@163.com wrote: NO, I’m referring to the result. you means there might be so many zero features in the als result ? I think it is not related to the initial state, but I do not know why the percent of zero-vector is so high(50% around) I looked into the ALS.scala, the user and product factors seems to be initialized by a gaussian distribution, so it should not be all-zero vector, right? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Mllib kmeans #iteration
Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could run the algorithm with fixed number of iterations in some way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to learn Spark ?
You can also refer this blog http://blog.prabeeshk.com/blog/archives/ On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo
Connection pooling in spark jobs
Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh
Re: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished
Hey Christopher, I'm working with Teng on this issue. Thank you for the explanation. I tried both workarounds: just leaving hive.metastore.warehouse.dir empty is not doing anything. Still the tmp data is written to S3 and the job attempts to rename/copy+delete from S3 to S3. But anyway, since the wished effect of this setting was not working before, we will discard this. So it will be empty in the future. I tried you're attempt with copying the hive-jars to the spark class path. It was breaking then (the query did not execute at all) because of this error message: errorMessage:java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.exec.Utilities.deserializeObjectByKryo(com.esotericsoftware.kryo.Kryo, java.io.InputStream, java.lang.Class)) I used a super simple query to produce this: SELECT CONCAT('some_string', some_string_col) FROM some_table; We suspect, that this comes from an too old Spark Hive Version, which was used to compile the Spark Version you build in your github project or other recency problems. We suggest recompiling your Spark Version with the AWS Hive Version, which has the Hive adaptions you mentioned already implemented. Or what do you think? Cheers Fabian 2015-04-02 10:19 GMT+02:00 Teng Qiu teng...@gmail.com: -- Forwarded message -- From: Bozeman, Christopher bozem...@amazon.com Date: 2015-04-01 22:43 GMT+02:00 Subject: RE: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished To: chutium teng@gmail.com, user@spark.apache.org user@spark.apache.org Teng, There is no need to alter hive.metastore.warehouse.dir. Leave it as is and just create external tables with location pointing to S3. What I suspect you are seeing is that spark-sql is writing to a temp directory within S3 then issuing a rename to the final location as would be done with HDFS. But in S3, there is not a rename operation so there is a performance hit as S3 performs a copy then delete. I tested 1TB from/to S3 external tables and it worked, it is just there the additional delay for the rename (copy). EMR has modified Hive to avoid the expensive rename and you can take advantage of this, too with Spark SQL by just copying the EMR Hive jars into the Spark class path. Like: /bin/ls /home/hadoop/.versions/hive-*/lib/*.jar | xargs -n 1 -I %% cp %% ~/spark/classpath/emr Please note that since EMR Hive is 0.13 at this time, this does break some other features already supported by spark-sql if using the built-in Hive library (for example, AVRO support). So if using this workaround to make a better performant query when writing to S3 be sure to test your use-case. Thanks Christopher -Original Message- From: chutium [mailto:teng@gmail.com] Sent: Wednesday, April 01, 2015 9:34 AM To: user@spark.apache.org Subject: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished Hi, we always get issues on inserting or creating table with Amazon EMR Spark version, by inserting about 1GB resultset, the spark sql query will never be finished. by inserting small resultset (like 500MB), works fine. *spark.sql.shuffle.partitions* by default 200 or *set spark.sql.shuffle.partitions=1* do not help. the log stopped at: */15/04/01 15:48:13 INFO s3n.S3NativeFileSystem: rename s3://hive-db/tmp/hive-hadoop/hive_2015-04-01_15-47-43_036_1196347178448825102-15/-ext-1 s3://hive-db/db_xxx/some_huge_table/* then only metrics.MetricsSaver logs. we set / property namehive.metastore.warehouse.dir/name values3://hive-db/value /property/ but hive.exec.scratchdir ist not set, i have no idea why the tmp files were created in /s3://hive-db/tmp/hive-hadoop// we just tried the newest Spark 1.3.0 on AMI 3.5.x and AMI 3.6 ( https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md ), still not work. anyone get same issue? any idea about how to fix it? i believe Amazon EMR's Spark version use com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem to access s3, but not the original hadoop s3n implementation, right? /home/hadoop/spark/classpath/emr/* and /home/hadoop/spark/classpath/emrfs/* is in classpath btw. is there any plan to use the new hadoop s3a implementation instead of s3n ? Thanks for any help. Teng -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-on-Spark-SQL-insert-or-create-table-with-Spark-running-on-AWS-EMR-s3n-S3NativeFileSystem-renamd-tp22340.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- *Fabian Wollert* Business Intelligence *POSTANSCHRIFT* Zalando SE 11501 Berlin
Re: there are about 50% all-zero vector in the als result
Oh, I found the reason. according to the ALS optimization formula : If a user’s all ratings are zero, that is, the R(i, Ii) is a zero matrix, so the final result feature of this user will be all-zero vector… 在 2015年4月2日,下午6:08,Sean Owen so...@cloudera.com 写道: You're referring to the initialization, not the result, right? It's possible that the resulting weight vectors are sparse although this looks surprising to me. But it is not related to the initial state, right? On Thu, Apr 2, 2015 at 10:43 AM, lisendong lisend...@163.com mailto:lisend...@163.com wrote: I found that there are about 50% all-zero vectors in the ALS result ( both in userFeatures and productFeatures) the result looks like this: PastedGraphic-1.tiff I looked into the ALS.scala, the user and product factors seems to be initialized by a gaussian distribution, so it should not be all-zero vector, right? 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。 共有 1 个附件 PastedGraphic-1.tiff(607K) 极速下载 http://preview.mail.163.com/xdownload?filename=PastedGraphic-1.tiffmid=1tbiDQnPDVQG6QKGSwAAsypart=3sign=60d32e9f90d3dd36c5858328dd96eabetime=1427969402uid=lisendong%40163.com
Re: How to learn Spark ?
You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo
Re: How to learn Spark ?
I have a self-study workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 8:33 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo
Re: Connection pooling in spark jobs
How long does each executor keep the connection open for? How many connections does each executor open? Are you certain that connection pooling is a performant and suitable solution? Are you running out of resources on the database server and cannot tolerate each executor having a single connection? If you need a solution that limits the number of open connections [resource starvation on the DB server] I think you'd have to fake it with a centralized counter of active connections, and logic within each executor that blocks when the counter is at a given threshold. If the counter is not at threshold, then an active connection can be created (after incrementing the shared counter). You could use something like ZooKeeper to store the counter value. This would have the overall effect of decreasing performance if your required number of connections outstrips the database's resources. On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri sateesh.kav...@gmail.com wrote: But this basically means that the pool is confined to the job (of a single app) in question, but is not sharable across multiple apps? The setup we have is a job server (the spark-jobserver) that creates jobs. Currently, we have each job opening and closing a connection to the database. What we would like to achieve is for each of the jobs to obtain a connection from a db pool Any directions on how this can be achieved? -- Sateesh On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org wrote: Connection pools aren't serializable, so you generally need to set them up inside of a closure. Doing that for every item is wasteful, so you typically want to use mapPartitions or foreachPartition rdd.mapPartition { part = setupPool part.map { ... See Design Patterns for using foreachRDD in http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote: http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh
Re: ArrayBuffer within a DataFrame
Thanks Michael - that was it! I was drawing a blank on this one for some reason - much appreciated! On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com wrote: A lateral view explode using HiveQL. I'm hopping to add explode shorthand directly to the df API in 1.4. On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee denny.g@gmail.com wrote: Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!
RE: Cannot run the example in the Spark 1.3.0 following the document
Looks like a typo, try: *df.select**(**df**(name), **df**(age) + 1)* Or df.select(name, age) PRs to fix docs are always appreciated :) On Apr 2, 2015 7:44 PM, java8964 java8...@hotmail.com wrote: The import command already run. Forgot the mention, the rest of examples related to df all works, just this one caused problem. Thanks Yong -- Date: Fri, 3 Apr 2015 10:36:45 +0800 From: fightf...@163.com To: java8...@hotmail.com; user@spark.apache.org Subject: Re: Cannot run the example in the Spark 1.3.0 following the document Hi, there you may need to add : import sqlContext.implicits._ Best, Sun -- fightf...@163.com *From:* java8964 java8...@hotmail.com *Date:* 2015-04-03 10:15 *To:* user@spark.apache.org *Subject:* Cannot run the example in the Spark 1.3.0 following the document I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this: // Select everybody, but increment the age by 1df.select(name, df(age) + 1).show()// name(age + 1)// Michael null// Andy31// Justin 20 But what I got on my Spark 1.3.0 is the following error: *Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_43)* *scala val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1c845f64 scala val df = sqlContext.jsonFile(/user/yzhang/people.json)* *df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]* *scala df.printSchema root |-- age: long (nullable = true) |-- name: string (nullable = true)* *scala df.select(name, df(age) + 1).show() console:30: error: overloaded method value select with alternatives: (col: String,cols: String*)org.apache.spark.sql.DataFrame and (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be applied to (String, org.apache.spark.sql.Column) df.select(name, df(age) + 1).show() ^* Is this a bug in Spark 1.3.0, or my build having some problem? Thanks
Re: Connection pooling in spark jobs
But this basically means that the pool is confined to the job (of a single app) in question, but is not sharable across multiple apps? The setup we have is a job server (the spark-jobserver) that creates jobs. Currently, we have each job opening and closing a connection to the database. What we would like to achieve is for each of the jobs to obtain a connection from a db pool Any directions on how this can be achieved? -- Sateesh On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org wrote: Connection pools aren't serializable, so you generally need to set them up inside of a closure. Doing that for every item is wasteful, so you typically want to use mapPartitions or foreachPartition rdd.mapPartition { part = setupPool part.map { ... See Design Patterns for using foreachRDD in http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote: http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh
Fwd:
Actually they may not be sequentially generated and also the list (RDD) could come from a different component. For example from this RDD : (105,918) (105,757) (502,516) (105,137) (516,816) (350,502) I would like to separate into two RDD's : 1) (105,918) (502,516) 2) (105,757) (105,137) (516,816) (350,502) Right now I am using a mutable Set variable to track the elements already selected. After coalescing the RDD to a single partition I am doing something like : val evalCombinations = collection.mutable.Set.empty[String] val currentValidCombinations = allCombinations .filter(p = { if(!evalCombinations.contains(p._1) !evalCombinations.contains(p._2)) { evalCombinations += p._1;evalCombinations += p._2; true } else false }) This approach is limited by memory of the executor this runs on.Appreciate any better more scalable solution. Thanks On Wed, Mar 25, 2015 at 3:13 PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: You're generating all possible pairs? In that case, why not just generate the sequential pairs you want from the start? On Wed, Mar 25, 2015 at 3:11 PM, Himanish Kushary himan...@gmail.com wrote: It will only give (A,B). I am generating the pair from combinations of the the strings A,B,C and D, so the pairs (ignoring order) would be (A,B),(A,C),(A,D),(B,C),(B,D),(C,D) On successful filtering using the original condition it will transform to (A,B) and (C,D) On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D) is out because A (and D) has been used etc. I was thinking of a option of using a shared variable to keep track of what has already been used but that may only work for a single partition and would not scale for larger dataset. Is there any other efficient way to accomplish this ? -- Thanks Regards Himanish -- Thanks Regards Himanish -- Thanks Regards Himanish -- Thanks Regards Himanish
Matei Zaharai: Reddit Ask Me Anything
*Ask Me Anything about Apache Spark big data* Reddit AMA with Matei Zaharia Friday, April 3 at 9AM PT/ 12PM ET Details can be found here: http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Matei-Zaharai-Reddit-Ask-Me-Anything-tp22364.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Connection pooling in spark jobs
Each executor runs for about 5 secs until which time the db connection can potentially be open. Each executor will have 1 connection open. Connection pooling surely has its advantages of performance and not hitting the dbserver for every open/close. The database in question is not just used by the spark jobs, but is shared by other systems and so the spark jobs have to better at managing the resources. I am not really looking for a db connections counter (will let the db handle that part), but rather have a pool of connections on spark end so that the connections can be reused across jobs On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke charles.fed...@gmail.com wrote: How long does each executor keep the connection open for? How many connections does each executor open? Are you certain that connection pooling is a performant and suitable solution? Are you running out of resources on the database server and cannot tolerate each executor having a single connection? If you need a solution that limits the number of open connections [resource starvation on the DB server] I think you'd have to fake it with a centralized counter of active connections, and logic within each executor that blocks when the counter is at a given threshold. If the counter is not at threshold, then an active connection can be created (after incrementing the shared counter). You could use something like ZooKeeper to store the counter value. This would have the overall effect of decreasing performance if your required number of connections outstrips the database's resources. On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri sateesh.kav...@gmail.com wrote: But this basically means that the pool is confined to the job (of a single app) in question, but is not sharable across multiple apps? The setup we have is a job server (the spark-jobserver) that creates jobs. Currently, we have each job opening and closing a connection to the database. What we would like to achieve is for each of the jobs to obtain a connection from a db pool Any directions on how this can be achieved? -- Sateesh On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger c...@koeninger.org wrote: Connection pools aren't serializable, so you generally need to set them up inside of a closure. Doing that for every item is wasteful, so you typically want to use mapPartitions or foreachPartition rdd.mapPartition { part = setupPool part.map { ... See Design Patterns for using foreachRDD in http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote: http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh
Re: Reading a large file (binary) into RDD
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to read the entire file. On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: If it’s a flat binary file and each record is the same length (in bytes), you can use Spark’s binaryRecords method (defined on the SparkContext), which loads records from one or more large flat binary files into an RDD. Here’s an example in python to show how it works: # write data from an array from numpy import random dat = random.randn(100,5) f = open('test.bin', 'w') f.write(dat) f.close() # load the data back in from numpy import frombuffer nrecords = 5 bytesize = 8 recordsize = nrecords * bytesize data = sc.binaryRecords('test.bin', recordsize) parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float')) # these should be equal parsed.first() dat[0,:] Does that help? - jeremyfreeman.net @thefreemanlab On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line-by-line read of the file at the driver and constructing the RDD.
Re: workers no route to host
It appears you are using a Cloudera Spark build, 1.3.0-cdh5.4.0-SNAPSHOT, which expects to find the hadoop command: /data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop: command not found If you don't want to use Hadoop, download one of the pre-built Spark releases from spark.apache.org. Even the Hadoop builds there will work okay, as they don't actually attempt to run Hadoop commands. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Mar 31, 2015 at 3:12 AM, ZhuGe t...@outlook.com wrote: Hi, i set up a standalone cluster of 5 machines(tmaster, tslave1,2,3,4) with spark-1.3.0-cdh5.4.0-snapshort. when i execute the sbin/start-all.sh, the master is ok, but i cant see the web ui. Moreover, the worker logs is something like this: Spark assembly has been built with Hive, including Datanucleus jars on classpath /data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop: command not found Spark Command: java -cp :/data/PlatformDep/cdh5/dist/sbin/../conf:/data/PlatformDep/cdh5/dist/lib/spark-assembly-1.3.0-cdh5.4.0-SNAPSHOT-hadoop2.6.0-cdh5.4.0-SNAPSHOT.jar:/data/PlatformDep/cdh5/dist/lib/datanucleus-rdbms-3.2.1.jar:/data/PlatformDep/cdh5/dist/lib/datanucleus-api-jdo-3.2.1.jar:/data/PlatformDep/cdh5/dist/lib/datanucleus-core-3.2.2.jar: -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://192.168.128.16:7071 --webui-port 8081 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/03/31 06:47:22 INFO Worker: Registered signal handlers for [TERM, HUP, INT] 15/03/31 06:47:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/31 06:47:23 INFO SecurityManager: Changing view acls to: dcadmin 15/03/31 06:47:23 INFO SecurityManager: Changing modify acls to: dcadmin 15/03/31 06:47:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dcadmin); users with modify permissions: Set(dcadmin) 15/03/31 06:47:23 INFO Slf4jLogger: Slf4jLogger started 15/03/31 06:47:23 INFO Remoting: Starting remoting 15/03/31 06:47:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker@tslave2:60815] 15/03/31 06:47:24 INFO Utils: Successfully started service 'sparkWorker' on port 60815. 15/03/31 06:47:24 INFO Worker: Starting Spark worker tslave2:60815 with 2 cores, 3.0 GB RAM 15/03/31 06:47:24 INFO Worker: Running Spark version 1.3.0 15/03/31 06:47:24 INFO Worker: Spark home: /data/PlatformDep/cdh5/dist 15/03/31 06:47:24 INFO Server: jetty-8.y.z-SNAPSHOT 15/03/31 06:47:24 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:8081 15/03/31 06:47:24 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 15/03/31 06:47:24 INFO WorkerWebUI: Started WorkerWebUI at http://tslave2:8081 15/03/31 06:47:24 INFO Worker: Connecting to master akka.tcp:// sparkMaster@192.168.128.16:7071/user/Master... 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp:// sparkMaster@192.168.128.16:7071]: Error [Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host ] 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp:// sparkMaster@192.168.128.16:7071]: Error [Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host ] 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp:// sparkMaster@192.168.128.16:7071]: Error [Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: No route to host ] 15/03/31 06:47:24 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@tslave2:60815] - [akka.tcp:// sparkMaster@192.168.128.16:7071]: Error [Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@192.168.128.16:7071] the worker machines
conversion from java collection type to scala JavaRDDObject
Hi All Is there an way to make the JavaRDDObject from existing java collection type ListObject? I know this can be done using scala , but i am looking how to do this using java. Regards Jeetendra
Re: Spark, snappy and HDFS
Thanks all. I was able to get the decompression working by adding the following to my spark-env.sh script: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar On Thu, Apr 2, 2015 at 12:51 AM, Sean Owen so...@cloudera.com wrote: Yes, any Hadoop-related process that asks for Snappy compression or needs to read it will have to have the Snappy libs available on the library path. That's usually set up for you in a distro or you can do it manually like this. This is not Spark-specific. The second question also isn't Spark-specific; you do not have a SequenceFile of byte[] / String, but of byte[] / byte[]. Review what you are writing since it is not BytesWritable / Text. On Thu, Apr 2, 2015 at 3:40 AM, Nick Travers n.e.trav...@gmail.com wrote: I'm actually running this in a separate environment to our HDFS cluster. I think I've been able to sort out the issue by copying /opt/cloudera/parcels/CDH/lib to the machine I'm running this on (I'm just using a one-worker setup at present) and adding the following to spark-env.sh: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/nickt/lib/hadoop/lib/snappy-java-1.0.4.1.jar I can get past the previous error. The issue now seems to be with what is being returned. import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() returns the following error: java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.io.Text On Wed, Apr 1, 2015 at 7:34 PM, Xianjin YE advance...@gmail.com wrote: Do you have the same hadoop config for all nodes in your cluster(you run it in a cluster, right?)? Check the node(usually the executor) which gives the java.lang.UnsatisfiedLinkError to see whether the libsnappy.so is in the hadoop native lib path. On Thursday, April 2, 2015 at 10:22 AM, Nick Travers wrote: Thanks for the super quick response! I can read the file just fine in hadoop, it's just when I point Spark at this file it can't seem to read it due to the missing snappy jars / so's. I'l paying around with adding some things to spark-env.sh file, but still nothing. On Wed, Apr 1, 2015 at 7:19 PM, Xianjin YE advance...@gmail.com wrote: Can you read snappy compressed file in hdfs? Looks like the libsnappy.so is not in the hadoop native lib path. On Thursday, April 2, 2015 at 10:13 AM, Nick Travers wrote: Has anyone else encountered the following error when trying to read a snappy compressed sequence file from HDFS? *java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z* The following works for me when the file is uncompressed: import org.apache.hadoop.io._ val hdfsPath = hdfs://nost.name/path/to/folder val file = sc.sequenceFile[BytesWritable,String](hdfsPath) file.count() but fails when the encoding is Snappy. I've seen some stuff floating around on the web about having to explicitly enable support for Snappy in spark, but it doesn't seem to work for me: http://www.ericlin.me/enabling-snappy-support-for-sharkspark http://www.ericlin.me/enabling-snappy-support-for-sharkspark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-snappy-and-HDFS-tp22349.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL. Memory consumption
Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0)) ,SUM(IF(field13 = 2012 , 1, 0)) ,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0)) FROM table1 CL JOIN table2 netw ON CL.field15 = netw.id WHERE AND field3 IS NOT NULL AND field4 IS NOT NULL AND field5 IS NOT NULL GROUP BY field1,field2,field3,field4, netw.field5 spark-submit --master spark://master:7077 *--driver-memory 20g --executor-memory 60g* --class GMain project_2.10-1.0.jar --driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*' --driver-java-options '-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*' 2 ./error Input data is 8GB in parquet format. Many times crash by *GC overhead*. I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB RAM/node) is collapsed. *Is it a query too difficult to Spark SQL? * *Would It be better to do it in Spark?* *Am I doing something wrong?* Thanks -- Regards. Miguel Ángel
Re: How to learn Spark ?
Thank you ! I Begin with it. Best Regards, Star Guo I have a self-study workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 8:33 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim https://mailfoogae.appspot.com/t?sender=admFkaW0uYmljaHV0c2tpeUBnbWFpbC5jb20%3Dtype=zerocontentguid=25ae00bb-d455-45e8-994c-b0e83ee8f68c ᐧ http://t.signauxtrois.com/e1t/o/5/f18dQhb0S7ks8dDMPbW2n0x6l2B9gXrN7sKj6v5dsrxW7gbZX-8q-6ZdVdnPvF2zlZNzW3hF9wD1k1H6H0?si=5533377798602752pi=9f8cc75d-3c1b-4f69-ef56-1f207f8f09f1 On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo
Spark Streaming Error in block pushing thread
I am running a spark streaming stand-alone cluster, connected to rabbitmq endpoint(s). The application will run for 20-30 minutes before failing with the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed out on [Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 23 - Ask timed out on [Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 20 - Ask timed out on [Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 19 - Ask timed out on [Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 18 - Ask timed out on [Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 17 - Ask timed out on [Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 16 - Ask timed out on [Actor[akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:54,151 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error reported by receiver for stream 0: Error in block pushing thread - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182) at org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87) Has anyone run into this before? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming Error in block pushing thread
Thank you for the response, Dean. There are 2 worker nodes, with 8 cores total, attached to the stream. I have the following settings applied: spark.executor.memory 21475m spark.cores.max 16 spark.driver.memory 5235m On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler deanwamp...@gmail.com wrote: Are you allocating 1 core per input stream plus additional cores for the rest of the processing? Each input stream Reader requires a dedicated core. So, if you have two input streams, you'll need local[3] at least. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 11:45 AM, byoung bill.yo...@threatstack.com wrote: I am running a spark streaming stand-alone cluster, connected to rabbitmq endpoint(s). The application will run for 20-30 minutes before failing with the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 23 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 20 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 19 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 18 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 17 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 16 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:54,151 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error reported by receiver for stream 0: Error in block pushing thread - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182) at org.apache.spark.streaming.receiver.BlockGenerator.org $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87) Has anyone run into this before? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- -- Bill Young Threat Stack | Senior Infrastructure Engineer http://www.threatstack.com
Re: Spark Streaming Error in block pushing thread
Sorry for the obvious typo, I have 4 workers with 16 cores total* On Thu, Apr 2, 2015 at 11:56 AM, Bill Young bill.yo...@threatstack.com wrote: Thank you for the response, Dean. There are 2 worker nodes, with 8 cores total, attached to the stream. I have the following settings applied: spark.executor.memory 21475m spark.cores.max 16 spark.driver.memory 5235m On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler deanwamp...@gmail.com wrote: Are you allocating 1 core per input stream plus additional cores for the rest of the processing? Each input stream Reader requires a dedicated core. So, if you have two input streams, you'll need local[3] at least. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 11:45 AM, byoung bill.yo...@threatstack.com wrote: I am running a spark streaming stand-alone cluster, connected to rabbitmq endpoint(s). The application will run for 20-30 minutes before failing with the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 23 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 20 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 19 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 18 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 17 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 16 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:54,151 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error reported by receiver for stream 0: Error in block pushing thread - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182) at org.apache.spark.streaming.receiver.BlockGenerator.org $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87) Has anyone run into this before? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- -- Bill Young Threat Stack | Senior Infrastructure Engineer http://www.threatstack.com -- -- Bill Young Threat Stack | Senior Infrastructure Engineer http://www.threatstack.com
Re: From DataFrame to LabeledPoint
Hi try next code: |val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case Row(feture1, feture2,..., label) = LabeledPoint(label, Vectors.dense(feature1, feature2, ...)) } | Thanks, Peter Rudenko On 2015-04-02 17:17, drarse wrote: Hello!, I have a questions since days ago. I am working with DataFrame and with Spark SQL I imported a jsonFile: /val df = sqlContext.jsonFile(file.json)/ In this json I have the label and de features. I selected it: / val features = df.select (feature1,feature2,feature3,...); val labels = df.select (cassification)/ But, now, I don't know create a LabeledPoint for RandomForest. I tried some solutions without success. Can you help me? Thanks for all! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re How to learn Spark ?
So cool !! Thanks. Best Regards, Star Guo = You can also refer this blog http://blog.prabeeshk.com/blog/archives/ On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: conversion from java collection type to scala JavaRDDObject
Use JavaSparkContext.parallelize. http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 11:33 AM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All Is there an way to make the JavaRDDObject from existing java collection type ListObject? I know this can be done using scala , but i am looking how to do this using java. Regards Jeetendra
Re: Spark Streaming Error in block pushing thread
Are you allocating 1 core per input stream plus additional cores for the rest of the processing? Each input stream Reader requires a dedicated core. So, if you have two input streams, you'll need local[3] at least. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 11:45 AM, byoung bill.yo...@threatstack.com wrote: I am running a spark streaming stand-alone cluster, connected to rabbitmq endpoint(s). The application will run for 20-30 minutes before failing with the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 23 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 20 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 19 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 18 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 17 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 16 - Ask timed out on [Actor[akka.tcp:// sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:54,151 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error reported by receiver for stream 0: Error in block pushing thread - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182) at org.apache.spark.streaming.receiver.BlockGenerator.org $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87) Has anyone run into this before? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException
To clarify one thing, is count() the first action ( http://spark.apache.org/docs/latest/programming-guide.html#actions) you're attempting? As defined in the programming guide, an action forces evaluation of the pipeline of RDDs. It's only then that reading the data actually occurs. So, count() might not be the issue, but some upstream step that attempted to read the file. As a sanity check, if you just read the text file and don't convert the strings, then call count(), does that work? If so, it might be something about your JavaBean BERecord after all. Can you post its definition? Also calling take(1) to grab the first element should also work, even if the RDD is empty. (It will return an empty RDD in that case, but not throw an exception.) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 10:16 AM, Ashley Rose ashley.r...@telarix.com wrote: That’s precisely what I was trying to check. It should have 42577 records in it, because that’s how many there were in the text file I read in. // Load a text file and convert each line to a JavaBean. JavaRDDString lines = sc.textFile(file.txt); JavaRDDBERecord tbBER = lines.map(s - convertToBER(s)); // Apply a schema to an RDD of JavaBeans and register it as a table. schemaBERecords = sqlContext.createDataFrame(tbBER, BERecord.class); schemaBERecords.registerTempTable(tbBER); The BERecord class is a standard Java Bean that implements Serializable, so that shouldn’t be the issue. As you said, count() shouldn’t fail like this even if the table was empty. I was able to print out the schema of the DataFrame just fine with df.printSchema(), and I just wanted to see if data was populated correctly. *From:* Dean Wampler [mailto:deanwamp...@gmail.com] *Sent:* Wednesday, April 01, 2015 6:05 PM *To:* Ashley Rose *Cc:* user@spark.apache.org *Subject:* Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException Is it possible tbBER is empty? If so, it shouldn't fail like this, of course. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Apr 1, 2015 at 5:57 PM, ARose ashley.r...@telarix.com wrote: Note: I am running Spark on Windows 7 in standalone mode. In my app, I run the following: DataFrame df = sqlContext.sql(SELECT * FROM tbBER); System.out.println(Count: + df.count()); tbBER is registered as a temp table in my SQLContext. When I try to print the number of rows in the DataFrame, the job fails and I get the following error message: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747) at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216) at org.apache.hadoop.io.UTF8.readString(UTF8.java:208) at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66) at org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1137) at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at
Spark Streaming Worker runs out of inodes
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see tons of old shuffle_*.data and *.index files that are never deleted. How do we get Spark to remove these files? We have a simple standalone app with one RabbitMQ receiver and a two node cluster (2 x r3large AWS instances). Batch interval is 10 minutes after which we process data and write results to DB. No windowing or state mgmt is used. I've poured over the documentation and tried setting the following properties but they have not helped. As a work around we're using a cron script that periodically cleans up old files but this has a bad smell to it. SPARK_WORKER_OPTS in spark-env.sh on every worker node spark.worker.cleanup.enabled true spark.worker.cleanup.interval spark.worker.cleanup.appDataTtl Also tried on the driver side: spark.cleaner.ttl spark.shuffle.consolidateFiles true -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Worker-runs-out-of-inodes-tp22355.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re:How to learn Spark ?
Thanks a lot. Follow you suggestion . Best Regards, Star Guo = The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download-deploy it in standalone mode-run some examples-try cluster deploy mode- then try to develop your own app and deploy it in your spark cluster. and it's better to learn scala well if you wanna dive into spark. Also there are some books about spark. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Star Guo st...@ceph.me 收件人:user@spark.apache.org 主题:How to learn Spark ? 日期:2015年04月02日 16点19分 Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
A stream of json objects using Java
I'm reading a stream of string lines that are in json format. I'm using Java with Spark. Is there a way to get this from a transformation? so that I end up with a stream of JSON objects. I would also welcome any feedback about this approach or alternative approaches. thanks jk
Spark streaming error in block pushing thread
I am running a standalone Spark streaming cluster, connected to multiple RabbitMQ endpoints. The application will run for 20-30 minutes before raising the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed out on [Actor[ akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 23 - Ask timed out on [Actor[ akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 20 - Ask timed out on [Actor[ akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,951 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 19 - Ask timed out on [Actor[ akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 18 - Ask timed out on [Actor[ akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 17 - Ask timed out on [Actor[ akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:53,952 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 16 - Ask timed out on [Actor[ akka.tcp://sparkExecutor@10.1.242.221:43018/user/BlockManagerActor1#-1913092216]] after [3 ms]} WARN 2015-04-01 21:00:54,151 org.apache.spark.streaming.scheduler.ReceiverTracker.logWarning.71: Error reported by receiver for stream 0: Error in block pushing thread - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:166) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:127) at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl$$anon$2.onPushBlock(ReceiverSupervisorImpl.scala:112) at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:182) at org.apache.spark.streaming.receiver.BlockGenerator.org http://org.apache.spark.streaming.receiver.blockgenerator.org/ $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155) at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87) Has anyone run into this before? -- Bill Young Threat Stack | Infrastructure Engineer http://www.threatstack.com
RE: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException
That’s precisely what I was trying to check. It should have 42577 records in it, because that’s how many there were in the text file I read in. // Load a text file and convert each line to a JavaBean. JavaRDDString lines = sc.textFile(file.txt); JavaRDDBERecord tbBER = lines.map(s - convertToBER(s)); // Apply a schema to an RDD of JavaBeans and register it as a table. schemaBERecords = sqlContext.createDataFrame(tbBER, BERecord.class); schemaBERecords.registerTempTable(tbBER); The BERecord class is a standard Java Bean that implements Serializable, so that shouldn’t be the issue. As you said, count() shouldn’t fail like this even if the table was empty. I was able to print out the schema of the DataFrame just fine with df.printSchema(), and I just wanted to see if data was populated correctly. From: Dean Wampler [mailto:deanwamp...@gmail.com] Sent: Wednesday, April 01, 2015 6:05 PM To: Ashley Rose Cc: user@spark.apache.org Subject: Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException Is it possible tbBER is empty? If so, it shouldn't fail like this, of course. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Editionhttp://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafehttp://typesafe.com @deanwamplerhttp://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Apr 1, 2015 at 5:57 PM, ARose ashley.r...@telarix.commailto:ashley.r...@telarix.com wrote: Note: I am running Spark on Windows 7 in standalone mode. In my app, I run the following: DataFrame df = sqlContext.sql(SELECT * FROM tbBER); System.out.println(Count: + df.count()); tbBER is registered as a temp table in my SQLContext. When I try to print the number of rows in the DataFrame, the job fails and I get the following error message: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747) at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216) at org.apache.hadoop.io.UTF8.readString(UTF8.java:208) at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66) at org.apache.spark.SerializableWritable$$anonfun$readObject$1.apply$mcV$sp(SerializableWritable.scala:43) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1137) at org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) This only happens when I try to call df.count(). The rest runs fine. Is the count() function not supported in standalone mode? The stack trace makes it appear to be Hadoop functionality... -- View this message in context:
From DataFrame to LabeledPoint
Hello!, I have a questions since days ago. I am working with DataFrame and with Spark SQL I imported a jsonFile: /val df = sqlContext.jsonFile(file.json)/ In this json I have the label and de features. I selected it: / val features = df.select (feature1,feature2,feature3,...); val labels = df.select (cassification)/ But, now, I don't know create a LabeledPoint for RandomForest. I tried some solutions without success. Can you help me? Thanks for all! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: A stream of json objects using Java
This just reduces to finding a library that can translate a String of JSON into a POJO, Map, or other representation of the JSON. There are loads of these, like Gson or Jackson. Sure, you can easily use these in a function that you apply to each JSON string in each line of the file. It's not different when this is run in Spark. On Thu, Apr 2, 2015 at 2:22 PM, James King jakwebin...@gmail.com wrote: I'm reading a stream of string lines that are in json format. I'm using Java with Spark. Is there a way to get this from a transformation? so that I end up with a stream of JSON objects. I would also welcome any feedback about this approach or alternative approaches. thanks jk - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Setup Spark jobserver for Spark SQL
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika matha.har...@gmail.com wrote: Hi, I am trying to Spark Jobserver( https://github.com/spark-jobserver/spark-jobserver https://github.com/spark-jobserver/spark-jobserver ) for running Spark SQL jobs. I was able to start the server but when I run my application(my Scala class which extends SparkSqlJob), I am getting the following as response: { status: ERROR, result: Invalid job type for this context } Can any one suggest me what is going wrong or provide a detailed procedure for setting up jobserver for SparkSQL? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Setup-Spark-jobserver-for-Spark-SQL-tp22352.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Connection pooling in spark jobs
Connection pools aren't serializable, so you generally need to set them up inside of a closure. Doing that for every item is wasteful, so you typically want to use mapPartitions or foreachPartition rdd.mapPartition { part = setupPool part.map { ... See Design Patterns for using foreachRDD in http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote: http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh
A problem with Spark 1.3 artifacts
A very simple example which works well with Spark 1.2, and fail to compile with Spark 1.3: build.sbt: name := untitled version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 Test.scala: package org.apache.spark.metrics import org.apache.spark.SparkEnv class Test { SparkEnv.get.metricsSystem.report() } Produces: Error:scalac: bad symbolic reference. A signature in MetricsSystem.class refers to term eclipse in package org which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling MetricsSystem.class. Error:scalac: bad symbolic reference. A signature in MetricsSystem.class refers to term jetty in value org.eclipse which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling MetricsSystem.class. This looks like something wrong with shading jetty. MetricsSystem references MetricsServlet which references some classes from Jetty, in the original package instead of shaded one. I'm not sure, but adding the following dependencies solves the problem: libraryDependencies += org.eclipse.jetty % jetty-server % 8.1.14.v20131031 libraryDependencies += org.eclipse.jetty % jetty-servlet % 8.1.14.v20131031 Is it intended or is it a bug? Thanks ! Jacek
Re: How to learn Spark ?
Yes, I just search for it ! Best Regards, Star Guo == You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark + Kinesis
Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify *build.sbt*: --- *import AssemblyKeys._* *name := Kinesis Consumer* *version := 1.0organization := com.myconsumerscalaVersion := 2.11.5libraryDependencies ++= Seq(org.apache.spark % spark-core_2.10 % 1.3.0 % provided,org.apache.spark % spark-streaming_2.10 % 1.3.0org.apache.spark % spark-streaming-kinesis-asl_2.10 % 1.3.0)* *assemblySettingsjarName in assembly := consumer-assembly.jarassemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala=false)* In *project/assembly.sbt* I have only the following line: *addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)* I am using sbt 0.13.7. I adapted Example 7.7 in the Learning Spark book. Thanks, Vadim ᐧ