Graphx CompactBuffer help

2015-08-28 Thread smagadi
after running the below code val coonn=graph.connectedComponents.vertices.map(_.swap).groupByKey.map(_._2).collect() coonn.foreach(println) I get array of compact buffers CompactBuffer(13) CompactBuffer(14) CompactBuffer(4, 11, 1, 6, 3, 7, 9, 8, 10, 5, 2) CompactBuffer(15, 12) Now i want to

Re: How to increase the Json parsing speed

2015-08-28 Thread Sabarish Sasidharan
How many executors are you using when using Spark SQL? On Fri, Aug 28, 2015 at 12:12 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: I see that you are not reusing the same mapper instance in the Scala snippet. Regards Sab On Fri, Aug 28, 2015 at 9:38 AM, Gavin Yue

Re: How to increase the Json parsing speed

2015-08-28 Thread Ewan Higgs
Hi Gavin, You can increase the speed by choosing a better encoding. A little bit of ETL goes a long way. e.g. As you're working with Spark SQL you probably have a tabular format. So you could use CSV so you don't need to parse the field names on each entry (and it will also reduce the file

Re: How to increase the Json parsing speed

2015-08-28 Thread Gavin Yue
500 each with 8GB memory. I did the test again on the cluster. I have 6000 files which generates 6000 tasks. Each task takes 1.5 min to finish based on the Stats. So theoretically it should take 15 mins roughly. WIth some additinal overhead, it totally takes 18 mins. Based on the local file

Re: RDD from partitions

2015-08-28 Thread Rishitesh Mishra
Hi Jem, A simple way to get this is to use MapPartitionedRDD. Please see the below code. For this you need to know your parent RDD's partition numbers that you want to exclude. One drawback here is the new RDD will also invoke similar number of tasks as parent RDDs as both the RDDs have same

Re: Re: Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-28 Thread our...@cnsuning.com
Terry: Unfortunately, error remains when use your advice.But error is changed ,now error is java.lang.ArrayIndexOutOfBoundsException: 71 error log as following: 15/08/28 19:13:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool

How to determine a good set of parameters for a ML grid search task?

2015-08-28 Thread Adamantios Corais
I have a sparse dataset of size 775946 x 845372. I would like to perform a grid search in order to tune the parameters of my LogisticRegressionWithSGD model. I have noticed that the building of each model takes about 300 to 400 seconds. That means that in order to try all possible combinations of

Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-28 Thread our...@cnsuning.com
hi all, when using spark sql ,A problem bothering me. the codeing as following: val schemaString =

Re: Graphx CompactBuffer help

2015-08-28 Thread Robineast
my previous reply got mangled This should work: coon.filter(x = x.exists(el = Seq(1,15).contains(el))) CompactBuffer is a specialised form of a Scala Iterator --- Robin East Spark GraphX in Action Michael Malak and

Alternative to Large Broadcast Variables

2015-08-28 Thread Hemminger Jeff
Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process. From what I

Re: Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-28 Thread Terry Hole
Ricky, You may need to use map instead of flatMap in your case *val rowRDD=sc.textFile(/user/spark/short_model).map(_.split(\\t)).map(p = Row(...))* Thanks! -Terry On Fri, Aug 28, 2015 at 5:08 PM, our...@cnsuning.com our...@cnsuning.com wrote: hi all, when using spark sql ,A problem

Re: Spark driver locality

2015-08-28 Thread Rishitesh Mishra
Hi Swapnil, 1. All the task scheduling/retry happens from Driver. So you are right that a lot of communication happens from driver to cluster. It all depends on the how you want to go about your Spark application, whether your application has direct access to Spark cluster or its routed through a

RE: How to increase the Json parsing speed

2015-08-28 Thread Ewan Leith
Can you post roughly what you’re running as your Spark code? One issue I’ve seen before is that passing a directory full of files as a path “/path/to/files/” can be slow, while “/path/to/files/*” runs fast. Also, if you’ve not seen it, have a look at the binaryFiles call

Spark Version upgrade isue:Exception in thread main java.lang.NoSuchMethodError

2015-08-28 Thread Manohar753
Hi Team, I upgraded spark older versions to 1.4.1 after maven build i tried to ran my simple application but it failed and giving the below stacktrace. Exception in thread main java.lang.NoSuchMethodError:

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Asher Krim
Yes, absolutely. Take a look at: https://spark.apache.org/docs/1.4.1/mllib-statistics.html#summary-statistics On Fri, Aug 28, 2015 at 8:39 AM, ashensw as...@wso2.com wrote: Hi all, I have a dataset which consist of large number of features(columns). It is in csv format. So I loaded it into a

RE: Feedback: Feature request

2015-08-28 Thread Murphy, James
This is great and much appreciated. Thank you. - Jim From: Manish Amde [mailto:manish...@gmail.com] Sent: Friday, August 28, 2015 9:20 AM To: Cody Koeninger Cc: Murphy, James; user@spark.apache.org; d...@spark.apache.org Subject: Re: Feedback: Feature request Sounds good. It's a request I have

Re: Feedback: Feature request

2015-08-28 Thread Cody Koeninger
I wrote some code for this a while back, pretty sure it didn't need access to anything private in the decision tree / random forest model. If people want it added to the api I can put together a PR. I think it's important to have separately parseable operators / operands though. E.g

Re: Feedback: Feature request

2015-08-28 Thread Manish Amde
Sounds good. It's a request I have seen a few times in the past and have needed it personally. May be Joseph Bradley has something to add. I think a JIRA to capture this will be great. We can move this discussion to the JIRA then. On Friday, August 28, 2015, Cody Koeninger c...@koeninger.org

Re: Spark driver locality

2015-08-28 Thread Swapnil Shinde
Thanks.. On Aug 28, 2015 4:55 AM, Rishitesh Mishra rishi80.mis...@gmail.com wrote: Hi Swapnil, 1. All the task scheduling/retry happens from Driver. So you are right that a lot of communication happens from driver to cluster. It all depends on the how you want to go about your Spark

Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread ashensw
Hi all, I have a dataset which consist of large number of features(columns). It is in csv format. So I loaded it into a spark dataframe. Then I converted it into a JavaRDDRow Then using a spark transformation I converted that into JavaRDDString[]. Then again converted it into a JavaRDDdouble[].

Why transformer from ml.Pipeline transform only a DataFrame ?

2015-08-28 Thread Jaonary Rabarisoa
Hi there, The actual API of ml.Transformer use only DataFrame as input. I have a use case where I need to transform a single element. For example transforming an element from spark-streaming. Is there any reason for this or the ml.Transformer will support transforming a single element later ?

How to compute the probability of each class in Naive Bayes

2015-08-28 Thread Adamantios Corais
Hi, I am trying to change the following code so as to get the probabilities of the input Vector on each class (instead of the class itself with the highest probability). I know that this is already available as part of the most recent release of Spark but I have to use Spark 1.1.0. Any help is

Re: Why transformer from ml.Pipeline transform only a DataFrame ?

2015-08-28 Thread Xiangrui Meng
Yes, we will open up APIs in next release. There were some discussion about the APIs. One approach is to have multiple methods for different outputs like predicted class and probabilities. -Xiangrui On Aug 28, 2015 6:39 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi there, The actual API of

Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Darksu
Hello, This my first post, so i would like to congratulate the spark team for the great work! In short i have been studying Spark for the past week in order to create a feasibility project. The main goal of the project is to process text documents (word count will not be over 200 words) in

RE: correct use of DStream foreachRDD

2015-08-28 Thread Ewan Leith
I think what you’ll want is to carry out the .map functions before the foreachRDD, something like: val lines = ssc.textFileStream(/stream).map(Sensor.parseSensor).map(Sensor.convertToPut) lines.foreachRDD { rdd = // parse the line of data into sensor object

Re: correct use of DStream foreachRDD

2015-08-28 Thread Sean Owen
Yes, for example val sensorRDD = rdd.map(Sensor.parseSensor) is a line of code executed on the driver; it's part the function you supplied to foreachRDD. However that line defines an operation on an RDD, and the map function you supplied (parseSensor) will ultimately be carried out on the cluster.

correct use of DStream foreachRDD

2015-08-28 Thread Carol McDonald
I would like to make sure that I am using the DStream foreachRDD operation correctly. I would like to read from a DStream transform the input and write to HBase. The code below works , but I became confused when I read Note that the function *func* is executed in the driver process ? val

Kmeans issues and hierarchical clustering

2015-08-28 Thread Robust_spark
Dear All, I am trying to cluster 350k english text phrases (each with 4-20 words) into 50k clusters with KMeans on a standalone system (8 cores, 16 GB). I am using Kyro serializer with MEMORY_AND_DISK_SER set. Although I get clustering results with lower number of features in HashingTF, the

Re: RDD from partitions

2015-08-28 Thread Jem Tucker
Hey Rishitesh, Thats perfect thanks so much! Dont know why i didnt think of using mapPartitions like this Thanks, Jem On Fri, Aug 28, 2015 at 10:35 AM Rishitesh Mishra rishi80.mis...@gmail.com wrote: Hi Jem, A simple way to get this is to use MapPartitionedRDD. Please see the below code.

how to register CompactBuffer in Kryo

2015-08-28 Thread donhoff_h
Hi, all I wrote a spark program which uses the Kryo serialization. When I count a rdd which type is RDD[(String,String)], it reported an Exception like the following : * Class is not registered: org.apache.spark.util.collection.CompactBuffer[] * Note: To register this class use:

Re: How to send RDD result to REST API?

2015-08-28 Thread Ted Yu
What format does your REST server expect ? You may have seen this: https://www.paypal-engineering.com/2014/02/13/hello-newman-a-rest-client-for-scala/ On Fri, Aug 28, 2015 at 9:35 PM, Cassa L lcas...@gmail.com wrote: Hi, If I have RDD that counts something e.g.: JavaPairDStreamString,

How to send RDD result to REST API?

2015-08-28 Thread Cassa L
Hi, If I have RDD that counts something e.g.: JavaPairDStreamString, Integer successMsgCounts = successMsgs .flatMap(buffer - Arrays.asList(buffer.getType())) .mapToPair(txnType - new Tuple2String, Integer(Success + txnType, 1))

Re: Re: Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-28 Thread ai he
Hi Ricky, In your first try, you are using flatMap. It will give you a flat list of strings. Then you are trying to map each string to a Row, which definitely throws an exception. Following Terry's idea, you are mapping the input to a list of arrays, each of which contains some strings. Then you

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Burak Yavuz
Or you can just call describe() on the dataframe? In addition to min-max, you'll also get the mean, and count of non-null and non-NA elements as well. Burak On Fri, Aug 28, 2015 at 10:09 AM, java8964 java8...@hotmail.com wrote: Or RDD.max() and RDD.min() won't work for you? Yong

Re: how to register CompactBuffer in Kryo

2015-08-28 Thread Ted Yu
For the exception w.r.t. ManifestFactory , there is SPARK-6497 which is Open. FYI On Fri, Aug 28, 2015 at 8:25 AM, donhoff_h 165612...@qq.com wrote: Hi, all I wrote a spark program which uses the Kryo serialization. When I count a rdd which type is RDD[(String,String)], it reported an

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Jesse F Chen
If you already loaded csv data into a dataframe, why not register it as a table, and use Spark SQL to find max/min or any other aggregates? SELECT MAX(column_name) FROM dftable_name ... seems natural.

Re: Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Ritesh Kumar Singh
Load the textFile as an RDD. Something like this: val file = sc.textFile(/path/to/file) After this you can manipulate this RDD to filter texts the way you want them : val a1 = file.filter( line = line.contains([ERROR]) ) val a2 = file.filter( line = line.contains([WARN]) ) val a3 =

Re: Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Jörn Franke
I think there is already an example for this shipped with Spark. However, you do not benefit really from any spark functionality for this scenario. If you want to do something more advanced you should look at Elasticsearch or Solr Le ven. 28 août 2015 à 16:15, Darksu nick_tou...@hotmail.com a

Re: correct use of DStream foreachRDD

2015-08-28 Thread Carol McDonald
Thanks, this looks better // parse the lines of data into sensor objects val sensorDStream = ssc.textFileStream(/stream). map(Sensor.parseSensor) sensorDStream.foreachRDD { rdd = // filter sensor data for low psi val alertRDD = rdd.filter(sensor = sensor.psi 5.0)

RE: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread java8964
Or RDD.max() and RDD.min() won't work for you? Yong Subject: Re: Calculating Min and Max Values using Spark Transformations? To: as...@wso2.com CC: user@spark.apache.org From: jfc...@us.ibm.com Date: Fri, 28 Aug 2015 09:28:43 -0700 If you already loaded csv data into a dataframe, why not

Dynamic lookup table

2015-08-28 Thread N B
Hi all, I have the following use case that I wanted to get some insight on how to go about doing in Spark Streaming. Every batch is processed through the pipeline and at the end, it has to update some statistics information. This updated info should be reusable in the next batch of this DStream

Re: Any quick method to sample rdd based on one filed?

2015-08-28 Thread Sonal Goyal
Filter into true rdd and false rdd. Union true rdd and sample of false rdd. On Aug 28, 2015 2:57 AM, Gavin Yue yue.yuany...@gmail.com wrote: Hey, I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and randomly keep some Boolean:false rows. And hope in the final result,

RE: How to avoid shuffle errors for a large join ?

2015-08-28 Thread java8964
There are several possibilities here. 1) Keep in mind that 7GB data will need way more than 7G heap, as deserialize java object needs much more space than data itself. Grand rule is multiple 6 to 8 times, so 7G data need 50G heap space.2) You should monitor the Spark UI, to check how many

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
I'm curious where the factor of 6-8 comes from ? Is this assuming snappy (or lzf) compression ? The sizes I mentioned are what the Spark UI reports, not sure if those are before or after compression (for the shuffle read/write). On Fri, Aug 28, 2015 at 4:41 PM, java8964 java8...@hotmail.com

Re: Dynamic lookup table

2015-08-28 Thread Jason
Hi Nikunj, Depending on what kind of stats you want to accumulate, you may want to look into the Accumulator/Accumulable API, or if you need more control, you can store these things in an external key-value store (HBase, redis, etc..) and do careful updates there. Though be careful and make sure

Re: Getting number of physical machines in Spark

2015-08-28 Thread Alexey Grishchenko
There's no canonical way to do this as I understand. For instance, when running under YARN, you have completely no idea where your containers would be started. Moreover, if one of the containers would fail, it might be restarted on another machine so the machine number might change at runtime To

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
Ahh yes, thanks for mentioning data skew, I've run into that before as well. The best way there is to get statistics on the distribution of your join key. If there are a few values with drastically larger number of values, then a reducer task will always be swamped no matter how many reducer side

Re: Dynamic lookup table

2015-08-28 Thread N B
Hi Jason, Thanks for the response. I believe I can look into a Redis based solution for storing this state externally. However, would it be possible to refresh this from the store with every batch i.e. what code can be written inside the pipeline to fetch this info from the external store? Also,

Re: Help Explain Tasks in WebUI:4040

2015-08-28 Thread Alexey Grishchenko
It really depends on the code. I would say that the easiest way is to restart the problematic action, find the straggler task and analyze whats happening with it with jstack / make a heap dump and analyze locally. For example, there might be the case that your tasks are connecting to some external

Re: Any quick method to sample rdd based on one filed?

2015-08-28 Thread Alexey Grishchenko
In my opinion aggragate+flatMap would work faster as it would make less passes through the data. Would work like this: import random def agg(x,y): x[0] += 1 if not y[1] else 0 x[1] += 1 if y[1] else 0 return x # Source data rdd = sc.parallelize(xrange(10), 5) rdd2 =

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Thomas Dudziak
Yeah, I tried with 10k and 30k and these still failed, will try with more then. Though that is a little disappointing, it only writes ~7TB of shuffle data which shouldn't in theory require more than 1000 reducers on my 10TB memory cluster (~7GB of spill per reducer). I'm now wondering if my

Re: query avro hive table in spark sql

2015-08-28 Thread Giri P
Any idea what causing this error 15/08/28 21:03:03 WARN scheduler.TaskSetManager: Lost task 34.0 in stage 9.0 (TID 20, dtord01hdw0228p.dc.dotomi.net): java.lang.RuntimeException: cannot find field message_campaign_id from [0:error_error_error_error_error_error_error, 1:cannot_determine_schema,

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Alexey Grishchenko
If the data is already in RDD, the easiest way to calculate min/max for each column would be an aggregate() function. It takes 2 functions as arguments - first is used to aggregate RDD values to your accumulator, the second is used to merge two accumulators. This way both min and max for all the

Support for Hive Storage Handle in Spark SQL and core Spark

2015-08-28 Thread Sourav Mazumder
Hi, I have data written in HDFS using a custom storage handler of Hive. Can I access that data in Spark using Spark SQL . For example can I write a Spark SQL to access the data from a hive table in HDFS which was created as - CREATE TABLE custom_table_1(key int, value string) STORED BY

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cassa L
Just to confirm, is this what you are mentioning about? Is there any example on how to set it? I believe it is for 0.8.3 version? https://cwiki.apache.org/confluence/display/KAFKA/Multiple+Listeners+for+Kafka+Brokers On Fri, Aug 28, 2015 at 12:52 PM, Sriharsha Chintalapani ka...@harsha.io

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas
Hi Everybody! Thanks for participating in the spark-ec2 survey. The full results are publicly viewable here: https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics The gist of the results is as follows: Most people found spark-ec2 useful as an easy way to

SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cassa L
Hi, I was going through SSL setup of Kafka. https://cwiki.apache.org/confluence/display/KAFKA/Deploying+SSL+for+Kafka However, I am also using Spark-Kafka streaming to read data from Kafka. Is there a way to activate SSL for spark streaming API or not possible at all? Thanks, LCassa

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cassa L
Hi I am using below Spark jars with Direct Stream API. spark-streaming-kafka_2.10 When I look at its pom.xml, Kafka libraries that its pulling in is groupIdorg.apache.kafka/groupId artifactIdkafka_${scala.binary.version}/artifactId version0.8.2.1/version I believe this

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Cody Koeninger
Yeah, the direct api uses the simple consumer On Fri, Aug 28, 2015 at 1:32 PM, Cassa L lcas...@gmail.com wrote: Hi I am using below Spark jars with Direct Stream API. spark-streaming-kafka_2.10 When I look at its pom.xml, Kafka libraries that its pulling in is

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Ted Yu
+1 on Jason's suggestion. bq. this large variable is broadcast many times during the lifetime Please consider making this large variable more granular. Meaning, reduce the amount of data transferred between the key value store and your app during update. Cheers On Fri, Aug 28, 2015 at 12:44

Re: Getting number of physical machines in Spark

2015-08-28 Thread Jason
I've wanted similar functionality too: when network IO bound (for me I was trying to pull things from s3 to hdfs) I wish there was a `.mapMachines` api where I wouldn't have to try guess at the proper partitioning of a 'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk from S3

Re: SSL between Kafka and Spark Streaming API

2015-08-28 Thread Sourabh Chandak
Can we use the existing kafka spark streaming jar to connect to a kafka server running in SSL mode? We are fine with non SSL consumer as our kafka cluster and spark cluster are in the same network Thanks, Sourabh On Fri, Aug 28, 2015 at 12:03 PM, Gwen Shapira g...@confluent.io wrote: I can't

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Jason
You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing

Re: How to avoid shuffle errors for a large join ?

2015-08-28 Thread Jason
I had similar problems to this (reduce side failures for large joins (25bn rows with 9bn)), and found the answer was to further up the spark.sql.shuffle.partitions=1000. In my case, 16k partitions worked for me, but your tables look a little denser, so you may want to go even higher. On Thu, Aug

Re: TimeoutException on start-slave spark 1.4.0

2015-08-28 Thread Alexander Pivovarov
I have a workaround to the issue As you can see from the log it is about 15 sec btw worker start and shutdown. The workaround might be to sleep 30 sec, check if worker is running and if not try to start-slave again part of emr spark bootstrap py script spark_master = spark://...:7077 ...