Re: Why are there different "parts" in my CSV?

2015-02-12 Thread Akhil Das
For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","cs

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Yes, you can try it. For example, if you have a cluster of 10 executors, 60 Kafka partitions, you can try to choose 10 receivers * 2 consumer threads, so each thread will consume 3 partitions ideally, if you increase the threads to 6, each threads will consume 1 partitions ideally. What I think imp

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
Hi Saisai, If I understand correctly, you are suggesting that control parallelism by having number of consumers/executors at least 1:1 for number of kafka partitions. For example, if I have 50 partitions for a kafka topic then either have: - 25 or more executors, 25 receivers, each receiver set to

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Hi Tim, I think maybe you can try this way: create Receiver per executor and specify thread for each topic large than 1, and the total number of consumer thread will be: total consumer = (receiver number) * (thread number), and make sure this total consumer is less than or equal to Kafka partitio

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
I replaced the writeToKafka statements with a rdd.count() and sure enough, I have a stable app with total delay well within the batch window (20 seconds). Here's the total delay lines from the driver log: 15/02/13 06:14:26 INFO JobScheduler: Total delay: 6.521 s for time 142380806 ms (execution

How to sum up the values in the columns of a dataset in Scala?

2015-02-12 Thread Carter
I am new to Scala. I have a dataset with many columns, each column has a column name. Given several column names (these column names are not fixed, they are generated dynamically), I need to sum up the values of these columns. Is there an efficient way of doing this? I worked out a way by using f

Re: Master dies after program finishes normally

2015-02-12 Thread Akhil Das
Increasing your driver memory might help. Thanks Best Regards On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar wrote: > Hi, > I have a Hidden Markov Model running with 200MB data. > Once the program finishes (i.e. all stages/jobs are done) the program > hangs for 20 minutes or so before killing ma

Why are there different "parts" in my CSV?

2015-02-12 Thread Su She
Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be

An interesting and serious problem I encountered

2015-02-12 Thread Landmark
Hi foks, My Spark cluster has 8 machines, each of which has 377GB physical memory, and thus the total maximum memory can be used for Spark is more than 2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs, where the key is an integer and the value is an integer array with 43

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
> > I haven't been paying close attention to the JIRA tickets for > PrunedFilteredScan but I noticed some weird behavior around the filters > being applied when OR expressions were used in the WHERE clause. From what > I was seeing, it looks like it could be possible that the "start" and "end" > ra

Re: 8080 port password protection

2015-02-12 Thread Akhil Das
Just to add to what Arush said, you can go through these links: http://stackoverflow.com/questions/1162375/apache-port-proxy http://serverfault.com/questions/153229/password-protect-and-serve-apache-site-by-port-number Thanks Best Regards On Thu, Feb 12, 2015 at 10:43 PM, Arush Kharbanda < ar..

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Su She
Thanks Kevin for the link, I have had issues trying to install zeppelin as I believe it is not yet supported for CDH 5.3, and Spark 1.2. Please correct me if I am mistaken. On Thu, Feb 12, 2015 at 7:33 PM, Kevin (Sangwoo) Kim wrote: > Apache Zeppelin also has a scheduler and then you can reload

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Thank's for reply. I solved porblem with importing org.apache.spark.SparkContext._ by Imran Rashid suggestion. In the sake of interest, does JavaPairRDD intended for use from java? What is the purpose of this class? Does my rdd implicitly converted to it in some circumstances? 2015-02-12 19:42 GM

Re: Streaming scheduling delay

2015-02-12 Thread Saisai Shao
Hi Tim, I think this code will still introduce shuffle even when you call repartition on each input stream. Actually this style of implementation will generate more jobs (job per each input stream) than union into one stream as called DStream.union(), and union normally has no special overhead as

Re: HiveContext in SparkSQL - concurrency issues

2015-02-12 Thread Felix C
Your earlier call stack clearly states that it fails because the Derby metastore has already been started by another instance, so I think that is explained by your attempt to run this concurrently. Are you running Spark standalone? Do you have a cluster? You should be able to run spark in yarn-

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Thank you. That worked. 2015-02-12 20:03 GMT+04:00 Imran Rashid : > You need to import the implicit conversions to PairRDDFunctions with > > import org.apache.spark.SparkContext._ > > (note that this requirement will go away in 1.3: > https://issues.apache.org/jira/browse/SPARK-4397) > > On Thu,

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
Ok. I just verified that this is the case with a little test: WHERE (a = 'v1' and b = 'v2')PrunedFilteredScan passes down 2 filters WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0 filters On Fri, Feb 13, 2015 at 12:28 AM, Corey Nolet wrote: > Michael, > > I haven

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
Michael, I haven't been paying close attention to the JIRA tickets for PrunedFilteredScan but I noticed some weird behavior around the filters being applied when OR expressions were used in the WHERE clause. From what I was seeing, it looks like it could be possible that the "start" and "end" rang

Re: Streaming scheduling delay

2015-02-12 Thread Cody Koeninger
outdata.foreachRDD( rdd => rdd.foreachPartition(rec => { val writer = new KafkaOutputService(otherConf("kafkaProducerTopic").toString, propsMap) writer.output(rec) }) ) So this is creating a new kafka producer for every n

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
1. Can you try count()? Take often does not force the entire computation. 2. Can you give the full log. From the log it seems that the blocks are added to two nodes but the tasks seem to be launched to different nodes. I dont see any message removing the blocks. So need the whole log to debug this.

Re: Using Spark SQL for temporal data

2015-02-12 Thread Michael Armbrust
Hi Corey, I would not recommend using the CatalystScan for this. Its lower level, and not stable across releases. You should be able to do what you want with PrunedFilteredScan

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
1) Yes, if I disable writing out to kafka and replace it with some very light weight action is rdd.take(1), the app is stable. 2) The partitions I spoke of in the previous mail are the number of partitions I create from each dStream. But yes, since I do processing and writing out, per partition, e

Re: Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Ted Yu
dev@spark is active. e.g. see: http://search-hadoop.com/m/JW1q5zQ1Xw/renaming+SchemaRDD+-%253E+DataFrame&subj=renaming+SchemaRDD+gt+DataFrame Cheers On Thu, Feb 12, 2015 at 8:09 PM, Manoj Samel wrote: > d...@spark.apache.org > mentio

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current implementation

Re: streaming joining multiple streams

2015-02-12 Thread Tathagata Das
Sorry for the late response. With the amount of data you are planning join, any system would take time. However, between Hive's MapRduce joins, and Spark's basic shuffle, and Spark SQL's join, the latter wins hands down. Furthermore, with the APIs of Spark and Spark Streaming, you will have to do s

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-12 Thread Tathagata Das
Can you give me the whole logs? TD On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg wrote: > OK that worked and getting close here ... the job ran successfully for a > bit and I got output for the first couple buckets before getting a > "java.lang.Exception: Could not compute split, block input-0-14

Re: HiveContext in SparkSQL - concurrency issues

2015-02-12 Thread Harika
Hi, I've been reading about Spark SQL and people suggest that using HiveContext is better. So can anyone please suggest a solution to the above problem. This is stopping me from moving forward with HiveContext. Thanks Harika -- View this message in context: http://apache-spark-user-list.10015

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
Hey Tim, Let me get the key points. 1. If you are not writing back to Kafka, the delay is stable? That is, instead of "foreachRDD { // write to kafka }" if you do "dstream.count", then the delay is stable. Right? 2. If so, then Kafka is the bottleneck. Is the number of partitions, that you spoke

Re: Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-12 Thread Tathagata Das
Could you come up with a minimal example through which I can reproduce the problem? On Tue, Feb 10, 2015 at 12:30 PM, conor wrote: > I am getting the following error when I kill the spark driver and restart > the job: > > 15/02/10 17:31:05 INFO CheckpointReader: Attempting to load checkpoint fro

Is there a separate mailing list for Spark Developers ?

2015-02-12 Thread Manoj Samel
d...@spark.apache.org mentioned on http://spark.apache.org/community.html seems to be bouncing. Is there another one ?

Re: Extract hour from Timestamp in Spark SQL

2015-02-12 Thread Michael Armbrust
This looks like your executors aren't running a version of spark with hive support compiled in. On Feb 12, 2015 7:31 PM, "Wush Wu" wrote: > Dear Michael, > > After use the org.apache.spark.sql.hive.HiveContext, the Exception: > "java.util. > NoSuchElementException: key not found: hour" is gone du

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Kevin (Sangwoo) Kim
Apache Zeppelin also has a scheduler and then you can reload your chart periodically, Check it out: http://zeppelin.incubator.apache.org/docs/tutorial/tutorial.html On Fri Feb 13 2015 at 7:29:00 AM Silvio Fiorito < silvio.fior...@granturing.com> wrote: > One method I’ve used is to publish ea

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
Hi Gerard, Great write-up and really good guidance in there. I have to be honest, I don't know why but setting # of partitions for each dStream to a low number (5-10) just causes the app to choke/crash. Setting it to 20 gets the app going but with not so great delays. Bump it up to 30 and I start

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread lihu
Thanks very much, you're right. I called the sc.stop() before the execute pool shutdown. On Fri, Feb 13, 2015 at 7:04 AM, Michael Armbrust wrote: > It looks to me like perhaps your SparkContext has shut down due to too > many failures. I'd look in the logs of your executors for more informatio

Re: exception with json4s render

2015-02-12 Thread Mohnish Kodnani
Any ideas on how to figure out what is going on when using json4s 3.2.11. I have a need to use 3.2.11 and just to see if things work I had downgraded to 3.2.10 and things started working. On Wed, Feb 11, 2015 at 11:45 AM, Charles Feduke wrote: > I was having a similar problem to this trying to

Re: Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Todd Nist
Thanks Michael. I will give it a try. On Thu, Feb 12, 2015 at 6:00 PM, Michael Armbrust wrote: > You can start a JDBC server with an existing context. See my answer here: > http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html > > On Thu, Feb

RE: spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
OK. I think I have to use "None" instead null, then it works. Still switching from Java. I can also just use the field name as what I assume. Great experience. From: java8...@hotmail.com To: user@spark.apache.org Subject: spark left outer join with java.lang.UnsupportedOperationException: empty

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
The more I'm thinking about this- I may try this instead: val myChunkedRDD: RDD[List[Event]] = inputRDD.mapPartitions(_ .grouped(300).toList) I wonder if this would work. I'll try it when I get back to work tomorrow. Yuyhao, I tried your approach too but it seems to be somehow moving all the da

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Many thanks! On Thu, Feb 12, 2015 at 3:31 PM, Sean Owen wrote: > This all describes how the implementation operates, logically. The > matrix P is never formed, for sure, certainly not by the caller. > > The implementation actually extends to handle negative values in R too > but it's all taken c

Re: Question about mllib als's implicit training

2015-02-12 Thread Sean Owen
This all describes how the implementation operates, logically. The matrix P is never formed, for sure, certainly not by the caller. The implementation actually extends to handle negative values in R too but it's all taken care of by the implementation. On Thu, Feb 12, 2015 at 11:29 PM, Crystal Xi

Re: Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
HI Sean, I am reading the paper of implicit training. Collaborative Filtering for Implicit Feedback Datasets It mentioned "To this end, let us introduce a set of binary variables p_ui, which indicates the preference of user u to item i. Th

RE: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Ganelin, Ilya
Hi all - I've spent a while playing with this. Two significant sources of speed up that I've achieved are 1) Manually multiplying the feature vectors and caching either the user or product vector 2) By doing so, if one of the RDDs is a global it becomes possible to parallelize this step by run

Re: Question about mllib als's implicit training

2015-02-12 Thread Sean Owen
Where there is no user-item interaction, you provide no interaction, not an interaction with strength 0. Otherwise your input is fully dense. On Thu, Feb 12, 2015 at 11:09 PM, Crystal Xing wrote: > Hi, > > I have some implicit rating data, such as the purchasing data. I read the > paper about th

Question about mllib als's implicit training

2015-02-12 Thread Crystal Xing
Hi, I have some implicit rating data, such as the purchasing data. I read the paper about the implicit training algorithm used in spark and it mentioned the for user-prodct pairs which do not have implicit rating data, such as no purchase, we need to provide the value as 0. This is different fro

spark left outer join with java.lang.UnsupportedOperationException: empty collection

2015-02-12 Thread java8964
Hi, I am using Spark 1.2.0 with Hadoop 2.2. Now I have to 2 csv files, but have 8 fields. I know that the first field from both files are IDs. I want to find all the IDs existed in the first file, but NOT in the 2nd file. I am coming with the following code in spark-shell. case class origAsLeft

Predicting Class Probability with Gradient Boosting/Random Forest

2015-02-12 Thread nilesh
We are using Gradient Boosting/Random Forests that I have found provide the best results for our recommendations. My issue is that I need the probability of the 0/1 label, and not the predicted label. In the spark scala api, I see that the predict method also has an option to provide the probabilit

Re: Task not serializable problem in the multi-thread SQL query

2015-02-12 Thread Michael Armbrust
It looks to me like perhaps your SparkContext has shut down due to too many failures. I'd look in the logs of your executors for more information. On Thu, Feb 12, 2015 at 2:34 AM, lihu wrote: > I try to use the multi-thread to use the Spark SQL query. > some sample code just like this: > > val

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Thanks, Sean! Glad to know it will be in the future release. On Thu, Feb 12, 2015 at 2:45 PM, Sean Owen wrote: > Not now, but see https://issues.apache.org/jira/browse/SPARK-3066 > > As an aside, it's quite expensive to make recommendations for all > users. IMHO this is not something to do, if y

Re: Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Michael Armbrust
You can start a JDBC server with an existing context. See my answer here: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-td20197.html On Thu, Feb 12, 2015 at 7:24 AM, Todd Nist wrote: > I have a question with regards to accessing SchemaRDD’s and Spark

Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Michael Armbrust
In Spark 1.3, parquet tables that are created through the datasources API will automatically calculate the sizeInBytes, which is used to broadcast. On Thu, Feb 12, 2015 at 12:46 PM, Dima Zhiyanov wrote: > Hello > > Has Spark implemented computing statistics for Parquet files? Or is there > any o

Re: Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Sean Owen
Not now, but see https://issues.apache.org/jira/browse/SPARK-3066 As an aside, it's quite expensive to make recommendations for all users. IMHO this is not something to do, if you can avoid it architecturally. For example, consider precomputing recommendations only for users whose probability of n

Re: Concurrent batch processing

2015-02-12 Thread Tathagata Das
So you have come across spark.streaming.concurrentJobs already :) Yeah, that is an undocumented feature that does allow multiple output operations to submitted in parallel. However, this is not made public for the exact reasons that you realized - the semantics in case of stateful operations is not

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Silvio Fiorito
One method I’ve used is to publish each batch to a message bus or queue with a custom UI listening on the other end, displaying the results in d3.js or some other app. As far as I’m aware there isn’t a tool that will directly take a DStream. Spark Notebook seems to have some support for updatin

Is there a fast way to do fast top N product recommendations for all users

2015-02-12 Thread Crystal Xing
Hi, I wonder if there is a way to do fast top N product recommendations for all users in training using mllib's ALS algorithm. I am currently calling public Rating [] recommendProducts(int user,

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
The nm logs only seems to contain similar to the following. Nothing else in the same time range. Any help? 2015-02-12 20:47:31,245 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Event EventType: KILL_CONTAINER sent to absent container container_1422406067005_

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Felix C
You would probably write to hdfs or check out https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html You might be able to retrofit it to you use case. --- Original Message --- From: "Su She" Sent: February 11, 2015 10:55 PM To: "Felix C" Cc: "Kelvin Chu" <2dot7

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
Looks like my clock is in sync: -bash-4.1$ date && curl -v s3.amazonaws.com Thu Feb 12 21:40:18 UTC 2015 * About to connect() to s3.amazonaws.com port 80 (#0) * Trying 54.231.12.24... connected * Connected to s3.amazonaws.com (54.231.12.24) port 80 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.19.7

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
So I tried this: .mapPartitions(itr => { itr.grouped(300).flatMap(items => { myFunction(items) }) }) and I tried this: .mapPartitions(itr => { itr.grouped(300).flatMap(myFunction) }) I tried making myFunction a method, a function val, and even moving it into a singleton obj

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
It seems unlikely to me that it would be a 2.2 issue, though not entirely impossible. Are you able to find any of the container logs? Is the NodeManager launching containers and reporting some exit code? -Sandy On Thu, Feb 12, 2015 at 1:21 PM, Anders Arpteg wrote: > No, not submitting from wi

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Anders Arpteg
No, not submitting from windows, from a debian distribution. Had a quick look at the rm logs, and it seems some containers are allocated but then released again for some reason. Not easy to make sense of the logs, but here is a snippet from the logs (from a test in our small test cluster) if you'd

RE: PySpark 1.2 Hadoop version mismatch

2015-02-12 Thread Michael Nazario
I looked at the environment which I ran the spark-submit command in, and it looks like there is nothing that could be messing with the classpath. Just to be sure, I checked the web UI which says the classpath contains: - The two jars I added: /path/to/avro-mapred-1.7.4-hadoop2.jar and lib/spark-

Re: Master dies after program finishes normally

2015-02-12 Thread Imran Rashid
The important thing here is the master's memory, that's where you're getting the GC overhead limit. The master is updating its UI to include your finished app when your app finishes, which would cause a spike in memory usage. I wouldn't expect the master to need a ton of memory just to serve the

Re: How to do broadcast join in SparkSQL

2015-02-12 Thread Dima Zhiyanov
Hello Has Spark implemented computing statistics for Parquet files? Or is there any other way I can enable broadcast joins between parquet file RDDs in Spark Sql? Thanks Dima -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-do-broadcast-join-in-Sp

Re: spark, reading from s3

2015-02-12 Thread Franc Carter
Check that your timezone is correct as well, an incorrect timezone can make it look like your time is correct when it is skewed. cheers On Fri, Feb 13, 2015 at 5:51 AM, Kane Kim wrote: > The thing is that my time is perfectly valid... > > On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das > wrote: >

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Sandy Ryza
What version of Java are you using? Core NLP dropped support for Java 7 in its 3.5.0 release. Also, the correct command line option is --jars, not --addJars. On Thu, Feb 12, 2015 at 12:03 PM, Deborah Siegel wrote: > Hi Abe, > I'm new to Spark as well, so someone else could answer better. A few

Re: Concurrent batch processing

2015-02-12 Thread Matus Faro
I've been experimenting with my configuration for couple of days and gained quite a bit of power through small optimizations, but it may very well be something I'm doing crazy that is causing this problem. To give a little bit of a background, I am in the early stages of a project that consumes a

Re: Why can't Spark find the classes in this Jar?

2015-02-12 Thread Deborah Siegel
Hi Abe, I'm new to Spark as well, so someone else could answer better. A few thoughts which may or may not be the right line of thinking.. 1) Spark properties can be set on the SparkConf, and with flags in spark-submit, but settings on SparkConf take precedence. I think your jars flag for spark-su

correct way to broadcast a variable

2015-02-12 Thread freedafeng
Suppose I have an object to broadcast and then use it in a mapper function, sth like follows, (Python codes) obj2share = sc.broadcast("Some object here") someRdd.map(createMapper(obj2share)).collect() The createMapper function will create a mapper function using the shared object's value. Anothe

Re: Concurrent batch processing

2015-02-12 Thread Arush Kharbanda
It could depend on the nature of your application but spark streaming would use spark internally and concurrency should be there what is your use case? Are you sure that your configuration is good? On Fri, Feb 13, 2015 at 1:17 AM, Matus Faro wrote: > Hi, > > Please correct me if I'm wrong, in

Concurrent batch processing

2015-02-12 Thread Matus Faro
Hi, Please correct me if I'm wrong, in Spark Streaming, next batch will not start processing until the previous batch has completed. Is there any way to be able to start processing the next batch if the previous batch is taking longer to process than the batch interval? The problem I am facing is

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
I have 5 workers each executor-memory 8GB of memory. My driver memory is 8 GB as well. They are all 8 core machines. To answer Imran's question my configurations are thus. executor_total_max_heapsize = 18GB This problem happens at the end of my program. I don't have to run a lot of jobs to see th

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
How many nodes do you have in your cluster, how many cores, what is the size of the memory? On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar wrote: > Hi Arush, > Mine is a CDH5.3 with Spark 1.2. > The only change to my spark programs are > -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000.

Re: Master dies after program finishes normally

2015-02-12 Thread Manas Kar
Hi Arush, Mine is a CDH5.3 with Spark 1.2. The only change to my spark programs are -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000. ..Manas On Thu, Feb 12, 2015 at 2:05 PM, Arush Kharbanda wrote: > What is your cluster configuration? Did you try looking at the Web UI? > There are

Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
What is your cluster configuration? Did you try looking at the Web UI? There are many tips here http://spark.apache.org/docs/1.2.0/tuning.html Did you try these? On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar wrote: > Hi, > I have a Hidden Markov Model running with 200MB data. > Once the progra

Re: spark, reading from s3

2015-02-12 Thread Kane Kim
The thing is that my time is perfectly valid... On Tue, Feb 10, 2015 at 10:50 PM, Akhil Das wrote: > Its with the timezone actually, you can either use an NTP to maintain > accurate system clock or you can adjust your system time to match with the > AWS one. You can do it as: > > telnet s3.amazo

Master dies after program finishes normally

2015-02-12 Thread Manas Kar
Hi, I have a Hidden Markov Model running with 200MB data. Once the program finishes (i.e. all stages/jobs are done) the program hangs for 20 minutes or so before killing master. In the spark master the following log appears. 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fat

Re: can we insert and update with spark sql

2015-02-12 Thread Debasish Das
I thought more on it...can we provide access to the IndexedRDD through thriftserver API and let the mapPartitions query the API ? I am not sure if ThriftServer is as performant as opening up an API using other akka based frameworks (like play or spray)... Any pointers will be really helpful... Ne

Re: Shuffle on joining two RDDs

2015-02-12 Thread Davies Liu
The feature works as expected in Scala/Java, but not implemented in Python. On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid wrote: > I wonder if the issue is that these lines just need to add > preservesPartitioning = true > ? > > https://github.com/apache/spark/blob/master/python/pyspark/join.py#L

Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the "spark.kryo.registrator" property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet wrote: > I'm trying to register a custom class that extends Kryo's Serializer > interface. I can't tell exactly what Class the registerKryo

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Sandy Ryza
I ran against 2.6, not 2.2. For that yarn-client run, do you have the application master log? On Thu, Feb 12, 2015 at 6:11 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > This is tricky to debug. Check logs of node and resource manager of YARN > to see if you can trace the error. In

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
I wonder if the issue is that these lines just need to add preservesPartitioning = true ? https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38 I am getting the feeling this is an issue w/ pyspark On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid wrote: > ah, sorry I am not too

Re: 8080 port password protection

2015-02-12 Thread Arush Kharbanda
You could apply a password using a filter using a server. Though it dosnt looks like the right grp for the question. It can be done for spark also for Spark UI. On Thu, Feb 12, 2015 at 10:19 PM, MASTER_ZION (Jairo Linux) < master.z...@gmail.com> wrote: > Hi everyone, > > Im creating a development

spark mllib error when predict on linear regression model

2015-02-12 Thread Donbeo
Hi, I have a model and I am trying to predict regPoints. Here is the code that I have used. A more detailed question is available at http://stackoverflow.com/questions/28482476/spark-mllib-predict-error-with-map scala> model res26: org.apache.spark.mllib.regression.LinearRegressionModel = (weig

8080 port password protection

2015-02-12 Thread MASTER_ZION (Jairo Linux)
Hi everyone, Im creating a development machine in AWS and i would like to protect the port 8080 using a password. Is it possible? Best Regards *Jairo Moreno*

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
ah, sorry I am not too familiar w/ pyspark, sorry I missed that part. It could be that pyspark doesn't properly support narrow dependencies, or maybe you need to be more explicit about the partitioner. I am looking into the pyspark api but you might have some better guesses here than I thought.

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi, I believe that partitionBy will use the same (default) partitioner on both RDDs. On 2015-02-12 17:12, Sean Owen wrote: Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: Hi Karlson, I think your assumptions are correct

Re: obtain cluster assignment in K-means

2015-02-12 Thread Shi Yu
Thanks Robin, got it. On Thu, Feb 12, 2015 at 2:21 AM, Robin East wrote: > KMeans.train actually returns a KMeansModel so you can use predict() > method of the model > > e.g. clusters.predict(pointToPredict) > or > > clusters.predict(pointsToPredict) > > first is a single Vector, 2nd is RDD[Vect

Re: Shuffle on joining two RDDs

2015-02-12 Thread Sean Owen
Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid wrote: > Hi Karlson, > > I think your assumptions are correct -- that join alone shouldn't require > any shuffling. But its possible you are getting tripped up by lazy > evaluation of RDD

Re: Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi Imran, thanks for your quick reply. Actually I am doing this: rddA = rddA.partitionBy(n).cache() rddB = rddB.partitionBy(n).cache() followed by rddA.count() rddB.count() then joinedRDD = rddA.join(rddB) I thought that the count() would force the evaluation, so any subsequ

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Imran Rashid
You need to import the implicit conversions to PairRDDFunctions with import org.apache.spark.SparkContext._ (note that this requirement will go away in 1.3: https://issues.apache.org/jira/browse/SPARK-4397) On Thu, Feb 12, 2015 at 9:36 AM, Vladimir Protsenko wrote: > Hi. I am stuck with how to

Re: [hive context] Unable to query array once saved as parquet

2015-02-12 Thread Ayoub
Hi, as I was trying to find a work around until this bug will be fixed, I discovered an other bug posted here: https://issues.apache.org/jira/browse/SPARK-5775 For those who might had the same issue, one could use the "LOAD" sql command in a hive context to load the parquet file into the table as

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy evaluation of RDDs. After you do your partitionBy, are you sure those RDDs are actually materialized & cached somewhere? eg., if you just did

failing GraphX application ('GC overhead limit exceeded', 'Lost executor', 'Connection refused', etc.)

2015-02-12 Thread Matthew Cornell
Hi Folks, I'm running a five-step path following-algorithm on a movie graph with 120K verticies and 400K edges. The graph has vertices for actors, directors, movies, users, and user ratings, and my Scala code is walking the path "rating > movie > rating > user > rating". There are 75K rating no

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Ted Yu
You can use JavaPairRDD which has: override def wrapRDD(rdd: RDD[(K, V)]): JavaPairRDD[K, V] = JavaPairRDD.fromRDD(rdd) Cheers On Thu, Feb 12, 2015 at 7:36 AM, Vladimir Protsenko wrote: > Hi. I am stuck with how to save file to hdfs from spark. > > I have written MyOutputFormat extends FileO

saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Vladimir Protsenko
Hi. I am stuck with how to save file to hdfs from spark. I have written MyOutputFormat extends FileOutputFormat, then in spark calling this: rddres.saveAsHadoopFile[MyOutputFormat]("hdfs://localhost/output") or rddres.saveAsHadoopFile("hdfs://localhost/output", classOf[String], classOf[MyObj

Shuffle on joining two RDDs

2015-02-12 Thread Karlson
Hi All, using Pyspark, I create two RDDs (one with about 2M records (~200MB), the other with about 8M records (~2GB)) of the format (key, value). I've done a partitionBy(num_partitions) on both RDDs and verified that both RDDs have the same number of partitions and that equal keys reside on

Is it possible to expose SchemaRDD’s from thrift server?

2015-02-12 Thread Todd Nist
I have a question with regards to accessing SchemaRDD’s and Spark SQL temp tables via the thrift server. It appears that a SchemaRDD when created is only available in the local namespace / context and are unavailable to external services accessing Spark through thrift server via ODBC; is this corr

Use of nscala-time within spark-shell

2015-02-12 Thread Hammam
Hi All, Thanks in advance for your help. I have timestamp which I need to convert to datetime using scala. A folder contains the three needed jar files: "joda-convert-1.5.jar joda-time-2.4.jar nscala-time_2.11-1.8.0.jar" Using scala REPL and adding the jars: scala -classpath "*.jar" I can use ns

  1   2   >