Re: How to keep a SQLContext instance alive in a spark streaming application's life cycle?

2015-06-09 Thread drarse
Why? I tried this solution and works fine. El martes, 9 de junio de 2015, codingforfun [via Apache Spark User List] ml-node+s1001560n23218...@n3.nabble.com escribió: Hi drarse, thanks for replying, the way you said use a singleton object does not work 在 2015-06-09 16:24:25,drarse [via

RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Shao, Saisai
The shuffle data can be deleted through weak reference mechanism, you could check the code of ContextCleaner, also you could trigger a full gc manually with JVisualVM or some other tools to see if shuffle files are deleted. Thanks Jerry From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent:

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread amit tewari
Thanks Akhil,Mark for your valuable comments. Problem resolved. AT On Tue, Jun 9, 2015 at 2:17 PM, Akhil Das ak...@sigmoidanalytics.com wrote: ​I think Yes, as the documentation says Creates tuples of the elements in this RDD by applying f. ​ Thanks Best Regards On Tue, Jun 9, 2015 at

Re: Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
Hi 罗辉 I think you interpret the logs wrong. Your program actually runs from this point: (Rest of them are just starting up stuffs and connecting) 15/06/08 16:14:22 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0 15/06/08 16:14:23 INFO storage.MemoryStore:

回复:Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread luohui20001
hi akhil Not exactly ,the task took 54s to finish, started from 16:14:02 and ended at 16:14:56. within this 54s , it needs 19s to store value in memory, which started from 16:14:23 and ended at 16:14:42. I think this is the most time-wasting part of this task ,also unreasonable.You may check

RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Haopu Wang
Jerry, I agree with you. However, in my case, I kept the monitoring the blockmanager folder. I do see sometimes the number of files decreased, but the folder's size kept increasing. And below is a screenshot of the folder. You can see some old files are not deleted somehow.

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Daniel Darabos
It would be even faster to load the data on the driver and sort it there without using Spark :). Using reduce() is cheating, because it only works as long as the data fits on one machine. That is not the targeted use case of a distributed computation system. You can repeat your test with more data

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation or action APIs of RDD), it will be possible to have RDD of RDD. On Tue, Jun 9, 2015 at 1:47 PM, kiran lonikar loni...@gmail.com wrote: Simillar question was asked

Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Jeroen Vlek
Hi, I posted a question with regards to Phoenix and Spark Streaming on StackOverflow [1]. Please find a copy of the question to this email below the first stack trace. I also already contacted the Phoenix mailing list and tried the suggestion of setting spark.driver.userClassPathFirst.

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread Akhil Das
​I think Yes, as the documentation says Creates tuples of the elements in this RDD by applying f. ​ Thanks Best Regards On Tue, Jun 9, 2015 at 1:54 PM, amit tewari amittewar...@gmail.com wrote: Actually the question was will keyBy() take accept multiple fields (eg x(0), x(1)) as Key? On

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Simillar question was asked before: http://apache-spark-user-list.1001560.n3.nabble.com/Rdd-of-Rdds-td17025.html Here is one of the reasons why I think RDD[RDD[T]] is not possible: - RDD is only a handle to the actual data partitions. It has a reference/pointer to the *SparkContext* object

Re: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Benjamin Fradet
Hi, Are you restarting your Spark streaming context through getOrCreate? On 9 Jun 2015 09:30, Haopu Wang hw...@qilinsoft.com wrote: When I ran a spark streaming application longer, I noticed the local directory's size was kept increasing. I set spark.cleaner.ttl to 1800 seconds in order

Re: Rdd of Rdds

2015-06-09 Thread lonikar
Replicating my answer to another question asked today: Here is one of the reasons why I think RDD[RDD[T]] is not possible: * RDD is only a handle to the actual data partitions. It has a reference/pointer to the /SparkContext /object (/sc/) and a list of partitions. * The SparkContext is an

RE: [SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Shao, Saisai
From the stack I think this problem may be due to the deletion of broadcast variable, as you set the spark.cleaner.ttl, so after this timeout limit, the old broadcast variable will be deleted, you will meet this exception when you want to use it again after that time limit. Basically I think

回复:Re: How to decrease the time of storing block in memory

2015-06-09 Thread luohui20001
Only 1 minor GC, 0.07s. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: How to decrease the time of storing block in memory 日期:2015年06月09日 15点02分

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
You can put a Thread.sleep(10) in the code to have the UI available for quiet some time. (Put it just before starting any of your transformations) Or you can enable the spark history server https://spark.apache.org/docs/latest/monitoring.html too. I believe --jars

Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
Is it that task taking 19s? It won't be simply taking 19s to store 2KB of data into memory there could be other operations happening too (the transformations that you are doing), It would be good if you can paste the code snippet that you are running to have a better understanding. Thanks Best

RE: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Dong Lei
Thanks Akhil: The driver fails so fast to get a look at 4040. Is there any other way to see the download and ship process of the files? Is driver supposed to download these jars from HDFS to some location, then ship them to excutors? I can see from log that the driver downloaded the

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread amit tewari
Actually the question was will keyBy() take accept multiple fields (eg x(0), x(1)) as Key? On Tue, Jun 9, 2015 at 1:07 PM, amit tewari amittewar...@gmail.com wrote: Thanks Akhil, as you suggested, I have to go keyBy(route) as need the columns intact. But wil keyBy() take accept multiple

Re: Cassandra Submit

2015-06-09 Thread Yasemin Kaya
I couldn't find any solution. I can write but I can't read from Cassandra. 2015-06-09 8:52 GMT+03:00 Yasemin Kaya godo...@gmail.com: Thanks alot Mohammed, Gerard and Yana. I can write to table, but exception returns me. It says *Exception in thread main java.io.IOException: Failed to open

Re: FileOutputCommitter deadlock 1.3.1

2015-06-09 Thread Steve Loughran
On 8 Jun 2015, at 15:55, Richard Marscher rmarsc...@localytics.commailto:rmarsc...@localytics.com wrote: Hi, we've been seeing occasional issues in production with the FileOutCommitter reaching a deadlock situation. We are writing our data to S3 and currently have speculation enabled. What

RE: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-09 Thread Cheng, Hao
Is it the large result set return from the Thrift Server? And can you paste the SQL and physical plan? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, June 9, 2015 12:01 PM To: Sourav Mazumder Cc: user Subject: Re: Spark SQL with Thrift Server is very very slow and finally failing

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread Akhil Das
Try this way: scalaval input1 = sc.textFile(/test7).map(line = line.split(,).map(_.trim)); scalaval input2 = sc.textFile(/test8).map(line = line.split(,).map(_.trim)); scalaval input11 = input1.map(x=(*(x(0) + x(1)*),x(2),x(3))) scalaval input22 = input2.map(x=(*(x(0) + x(1)*),x(2),x(3))) scala

Re: Saving compressed textFiles from a DStream in Scala

2015-06-09 Thread Akhil Das
like this? myDStream.foreachRDD(rdd = rdd.saveAsTextFile(/sigmoid/, codec )) Thanks Best Regards On Mon, Jun 8, 2015 at 8:06 PM, Bob Corsaro rcors...@gmail.com wrote: It looks like saveAsTextFiles doesn't support the compression parameter of RDD.saveAsTextFile. Is there a way to add the

RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-09 Thread Haopu Wang
Cheng, yes, it works, I set the property in SparkConf before initiating SparkContext. The property name is spark.hadoop.dfs.replication Thanks fro the help! -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, June 08, 2015 6:41 PM To: Haopu Wang; user

Re: Error in using saveAsParquetFile

2015-06-09 Thread Bipin Nag
Cheng you were right. It works when I remove the field from either one. I should have checked the types beforehand. What confused me is that Spark attempted to join it and midway threw the error. It isn't quite there yet. Thanks for the help. On Mon, Jun 8, 2015 at 8:29 PM Cheng Lian

Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
May be you should check in your driver UI and see if there's any GC time involved etc. Thanks Best Regards On Mon, Jun 8, 2015 at 5:45 PM, luohui20...@sina.com wrote: hi there I am trying to descrease my app's running time in worker node. I checked the log and found the most

[SparkStreaming 1.3.0] Broadcast failure after setting spark.cleaner.ttl

2015-06-09 Thread Haopu Wang
When I ran a spark streaming application longer, I noticed the local directory's size was kept increasing. I set spark.cleaner.ttl to 1800 seconds in order clean the metadata. The spark streaming batch duration is 10 seconds and checkpoint duration is 10 minutes. The setting took effect but

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread amit tewari
Thanks Akhil, as you suggested, I have to go keyBy(route) as need the columns intact. But wil keyBy() take accept multiple fields (eg x(0), x(1))? Thanks Amit On Tue, Jun 9, 2015 at 12:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try this way: scalaval input1 =

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
Once you submits the application, you can check in the driver UI (running on port 4040) Environment Tab to see whether those jars you added got shipped or not. If they are shipped and still you are getting NoClassDef exceptions then it means that you are having a jar conflict which you can resolve

Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-09 Thread ayan guha
Hi I am little confused here. If I am writing to HDFS,shouldn't HDFS replication factor will automatically kick in? In other words, how spark writer is different than a hdfs -put commnd (from perspective of HDFS, of course)? Best Ayan On Tue, Jun 9, 2015 at 5:17 PM, Haopu Wang

Re: Error in using saveAsParquetFile

2015-06-09 Thread Cheng Lian
Yeah, this does look confusing. We are trying to improve the error reporting by catching similar issues at the end of the analysis phase and give more descriptive error messages. Part of the work can be found here:

Different Sorting RDD methods in Apache Spark

2015-06-09 Thread raggy
For a research project, I tried sorting the elements in an RDD. I did this in two different approaches. In the first method, I applied a mapPartitions() function on the RDD, so that it would sort the contents of the RDD, and provide a result RDD that contains the sorted list as the only record in

Re: Cassandra Submit

2015-06-09 Thread Yasemin Kaya
Yes my cassandra is listening on 9160 I think. Actually I know from yaml file. The file includes : rpc_address: localhost # port for Thrift to listen for clients on rpc_port: 9160 I check the port nc -z localhost 9160; echo $? it returns me 0. I think it close, should I open this port ?

Re: BigDecimal problem in parquet file

2015-06-09 Thread Cheng Lian
Would you please provide a snippet that reproduce this issue? What version of Spark were you using? Cheng On 6/9/15 8:18 PM, bipin wrote: Hi, When I try to save my data frame as a parquet file I get the following error: java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to

traverse a graph based on edge properties whilst counting matching vertex attributes

2015-06-09 Thread MA2
Hi All, I was hoping somebody might be able to help out, I currently have a network built using graphx which looks like the following (only with a much larger number of vertices and edges) Vertices ID, Attribute1, Attribute2 1001 2 0 1002 1 0 1003 2 1 1004 3 2 1006 4 0 1007 5 1

Re: Cassandra Submit

2015-06-09 Thread Yasemin Kaya
Sorry my answer I hit terminal lsof -i:9160: result is lsof -i:9160 COMMAND PIDUSER FD TYPE DEVICE SIZE/OFF NODE NAME java7597 inosens 101u IPv4 85754 0t0 TCP localhost:9160 (LISTEN) so 9160 port is available or not ? 2015-06-09 17:16 GMT+03:00 Yasemin Kaya

Issue running Spark 1.4 on Yarn

2015-06-09 Thread Matt Kapilevich
Hi all, I'm manually building Spark from source against 1.4 branch and submitting the job against Yarn. I am seeing very strange behavior. The first 2 or 3 times I submit the job, it runs fine, computes Pi, and exits. The next time I run it, it gets stuck in the ACCEPTED state. I'm kicking off a

Join between DStream and Periodically-Changing-RDD

2015-06-09 Thread Ilove Data
Hi, I'm trying to join DStream with interval let say 20s, join with RDD loaded from HDFS folder which is changing periodically, let say new file is coming to the folder for every 10 minutes. How should it be done, considering the HDFS files in the folder is periodically changing/adding new

Costs of transformations

2015-06-09 Thread Vijayasarathy Kannan
Is it possible bound costs of operations such as flatMap(), collect() based on the size of RDDs?

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-09 Thread Josh Mahonin
This may or may not be helpful for your classpath issues, but I wanted to verify that basic functionality worked, so I made a sample app here: https://github.com/jmahonin/spark-streaming-phoenix This consumes events off a Kafka topic using spark streaming, and writes out event counts to Phoenix

Re: Spark 1.3.1 SparkSQL metastore exceptions

2015-06-09 Thread Cheng Lian
Seems that you're using a DB2 Hive metastore? I'm not sure whether Hive 0.12.0 officially supports DB2, but probably not? (Since I didn't find DB2 scripts under the metastore/scripts/upgrade folder in Hive source tree.) Cheng On 6/9/15 8:28 PM, Needham, Guy wrote: Hi, I’m using Spark 1.3.1

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Marcelo Vanzin
If your application is stuck in that state, it generally means your cluster doesn't have enough resources to start it. In the RM logs you can see how many vcores / memory the application is asking for, and then you can check your RM configuration to see if that's currently available on any single

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Raghav Shankar
Thank you for you responses! You mention that it only works as long as the data fits on a single machine. What I am tying to do is receive the sorted contents of my dataset. For this to be possible, the entire dataset should be able to fit on a single machine. Are you saying that sorting the

Implementing top() using treeReduce()

2015-06-09 Thread raggy
I am trying to implement top-k in scala within apache spark. I am aware that spark has a top action. But, top() uses reduce(). Instead, I would like to use treeReduce(). I am trying to compare the performance of reduce() and treeReduce(). The main issue I have is that I cannot use these 2 lines

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
hm. Yeah, your port is good...have you seen this thread: http://stackoverflow.com/questions/27288380/fail-to-use-spark-cassandra-connector ? It seems that you might be running into version mis-match issues? What versions of Spark/Cassandra-connector are you trying to use? On Tue, Jun 9, 2015 at

Re: RDD of RDDs

2015-06-09 Thread Mark Hamstra
That would constitute a major change in Spark's architecture. It's not happening anytime soon. On Tue, Jun 9, 2015 at 1:34 AM, kiran lonikar loni...@gmail.com wrote: Possibly in future, if and when spark architecture allows workers to launch spark jobs (the functions passed to transformation

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
Are you saying that sorting the entire data and collecting it on the driver node is not a typical use case? It most definitely is not. Spark is designed and intended to be used with very large datasets. Far from being typical, collecting hundreds of gigabytes, terabytes or petabytes to the

Re: Different Sorting RDD methods in Apache Spark

2015-06-09 Thread Mark Hamstra
Correct. Trading away scalability for increased performance is not an option for the standard Spark API. On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: It would be even faster to load the data on the driver and sort it there without using Spark :).

Re: Problem getting program to run on 15TB input

2015-06-09 Thread Arun Luthra
I found that the problem was due to garbage collection in filter(). Using Hive to do the filter solved the problem. A lot of other problems went away when I upgraded to Spark 1.2.0, which compresses various task overhead data (HighlyCompressedMapStatus etc.). It has been running very very

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Matt Kapilevich
Hi Marcelo, Thanks. I think something more subtle is happening. I'm running a single-node cluster, so there's only 1 NM. When I executed the exact same job the 4th time, the cluster was idle, and there was nothing else being executed. RM currently reports that I have 6.5GB of memory and 4 cpus

Re: Cassandra Submit

2015-06-09 Thread Yasemin Kaya
My jar files are: cassandra-driver-core-2.1.5.jar cassandra-thrift-2.1.3.jar guava-18.jar jsr166e-1.1.0.jar spark-assembly-1.3.0.jar spark-cassandra-connector_2.10-1.3.0-M1.jar spark-cassandra-connector-java_2.10-1.3.0-M1.jar spark-core_2.10-1.3.1.jar spark-streaming_2.10-1.3.1.jar And my code

RE: Cassandra Submit

2015-06-09 Thread Mohammed Guller
It is strange that writes works but read does not. If it was a Cassandra connectivity issue, then neither write or read would work. Perhaps the problem is somewhere else. Can you send the complete exception trace? Also, just to make sure that there is no DNS issue, try this:

Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
Thanks Ayan, I used beeline in Spark to connect to Hiveserver2 that I started from my Hive. So as you said, It is really interacting with Hive as a typical 3rd party application, and it is NOT using Spark execution engine. I was thinking that it gets metastore info from Hive, but uses Spark to

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
Hm, jars look ok, although it's a bit of a mess -- you have spark-assembly 1.3.0 but then core and streaming 1.3.1...It's generally a bad idea to mix versions. Spark-assembly bundless all spark packages, so either do them separately or use spark-assembly but don't mix like you've shown. As to the

Re: Running SparkSql against Hive tables

2015-06-09 Thread James Pirz
I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on Hadoop 2.6. I looked the ThriftServer2 logs, and I realized that the server was not starting properly, because of failure in creating a server socket. In fact, I had passed the URI to my Hiveserver2 service, launched from Hive,

Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread karma243
Hello, While trying to link kafka to spark, I'm not able to get data from kafka. This is the error that I'm getting from spark logs: ERROR EndpointWriter: dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://sparkMaster@localhost:7077/]] arriving at

Re: spark eventLog and history server

2015-06-09 Thread Richard Marscher
Hi, I don't have a complete answer to your questions but: Removing the suffix does not solve the problem - unfortunately this is true, the master web UI only tries to build out a Spark UI from the event logs once, at the time the context is closed. If the event logs are in-progress at this time,

Re: Cassandra Submit

2015-06-09 Thread Yasemin Kaya
I removed core and streaming jar. And the exception still same. I tried what you said then results: ~/cassandra/apache-cassandra-2.1.5$ bin/cassandra-cli -h localhost -p 9160 Connected to: Test Cluster on localhost/9160 Welcome to Cassandra CLI version 2.1.5 The CLI is deprecated and will be

Re: [Kafka-Spark-Consumer] Spark-Streaming Job Fails due to Futures timed out

2015-06-09 Thread Snehal Nagmote
Hi Dibyendu, Thank you for your reply. I am using Kafka https://github.com/dibbhatt/kafka-spark-consumer which uses spark-core and spark-streaming *1.2.2* Spark cluster on which I am running application is* 1.3.1* . I will test it with latest changes . Yes Underlying BlockManager gives error

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Matt Kapilevich
Yes! If I either specify a different queue or don't specify a queue at all, it works. On Tue, Jun 9, 2015 at 4:25 PM, Marcelo Vanzin van...@cloudera.com wrote: Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo,

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Matt Kapilevich
From the RM scheduler, I see 3 applications currently stuck in the root.thequeue queue. Used Resources: memory:0, vCores:0 Num Active Applications: 0 Num Pending Applications: 3 Min Resources: memory:0, vCores:0 Max Resources: memory:6655, vCores:4 Steady Fair Share: memory:1664, vCores:0

[SPARK-6330] 1.4.0/1.5.0 Bug to access S3 -- AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or

2015-06-09 Thread Shuai Zheng
Hi All, I have some code to access s3 from Spark. The code is as simple as: JavaSparkContext ctx = new JavaSparkContext(sparkConf); Configuration hadoopConf = ctx.hadoopConfiguration(); // aws.secretKey=Zqhjim3GB69hMBvfjh+7NX84p8sMF39BHfXwO3Hs

Linear Regression with SGD

2015-06-09 Thread Stephen Carman
Hi User group, We are using spark Linear Regression with SGD as the optimization technique and we are achieving very sub-optimal results. Can anyone shed some light on why this implementation seems to produce such poor results vs our own implementation? We are using a very small dataset, but

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Marcelo Vanzin
Apologies, I see you already posted everything from the RM logs that mention your stuck app. Have you tried restarting the YARN cluster to see if that changes anything? Does it go back to the first few tries work behaviour? I run 1.4 on top of CDH 5.4 pretty often and haven't seen anything like

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Marcelo Vanzin
Does it work if you don't specify a queue? On Tue, Jun 9, 2015 at 1:21 PM, Matt Kapilevich matve...@gmail.com wrote: Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working,

Re: which database for gene alignment data ?

2015-06-09 Thread roni
Hi Frank, Thanks for the reply. I downloaded ADAM and built it but it does not seem to list this function for command line options. Are these exposed as public API and I can call it from code ? Also , I need to save all my intermediate data. Seems like ADAM stores data in Parquet on HDFS. I want

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Matt Kapilevich
Hi Marcelo, Yes, restarting YARN fixes this behavior and it again works the first few times. The only thing that's consistent is that once Spark job submissions stop working, it's broken for good. On Tue, Jun 9, 2015 at 4:12 PM, Marcelo Vanzin van...@cloudera.com wrote: Apologies, I see you

spark on yarn

2015-06-09 Thread Neera
In my test data, I have a JavaRDD with a single String(size of this RDD is 1). On a 3 node Yarn cluster, mapToPair function on this RDD sends the same input String to 2 different nodes. Container logs on these nodes show the same string as input. Overriding default partition count by

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Marcelo Vanzin
On Tue, Jun 9, 2015 at 11:31 AM, Matt Kapilevich matve...@gmail.com wrote: Like I mentioned earlier, I'm able to execute Hadoop jobs fine even now - this problem is specific to Spark. That doesn't necessarily mean anything. Spark apps have different resource requirements than Hadoop apps.

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-09 Thread nsalian
1) Could you share your command? 2) Are the kafka brokers on the same host? 3) Could you run a --describe on the topic to see if the topic is setup correctly (just to be sure)? -- View this message in context:

Re: Determining number of executors within RDD

2015-06-09 Thread maxdml
You should try, from the SparkConf object, to issue a get. I don't have the exact name for the matching key, but from reading the code in SparkSubmit.scala, it should be something like: conf.get(spark.executor.instances) -- View this message in context:

Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread nsalian
By writing PDF files, do you mean something equivalent to a hadoop fs -put /path? I'm not sure how Pdfbox works though, have you tried writing individually without spark? We can potentially look if you have established that as a starting point to see how Spark can be interfaced to write to HDFS.

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread nsalian
I see the other jobs SUCCEEDED without issues. Could you snapshot the FairScheduler activity as well? My guess it, with the single core, it is reaching a NodeManager that is still busy with other jobs and the job ends up in a waiting state. Does the job eventually complete? Could you

Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread Richard Catlin
I would like to write pdf files using pdfbox to HDFS from my Spark application. Can this be done? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233.html Sent from the Apache Spark User

Re: Issue running Spark 1.4 on Yarn

2015-06-09 Thread Matt Kapilevich
I've tried running a Hadoop app pointing to the same queue. Same thing now, the job doesn't get accepted. I've cleared out the queue and killed all the pending jobs, the queue is still unusable. It seems like an issue with YARN, but it's specifically Spark that leaves the queue in this state.

Re: spark-submit working differently than pyspark when trying to find external jars

2015-06-09 Thread Walt Schlender
I figured it out *in case anyone else has this problem in the future. spark-submit --driver-class-path lib/postgresql-9.4-1201.jdbc4.jar --packages com.databricks:spark-csv_2.10:1.0.3 path/to/my/script.py What I found is that you MUST put the path to your script at the end of the spark-submit

RE: Cassandra Submit

2015-06-09 Thread Mohammed Guller
Looks like the real culprit is a library version mismatch: Caused by: java.lang.NoSuchMethodError: org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(Ljava/lang/String;I)Lorg/apache/thrift/transport/TTransport; at

Re: Linear Regression with SGD

2015-06-09 Thread Robin East
Hi Stephen How many is a very large number of iterations? SGD is notorious for requiring 100s or 1000s of iterations, also you may need to spend some time tweaking the step-size. In 1.4 there is an implementation of ElasticNet Linear Regression which is supposed to compare favourably with an

BigDecimal problem in parquet file

2015-06-09 Thread bipin
Hi, When I try to save my data frame as a parquet file I get the following error: java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to org.apache.spark.sql.types.Decimal at org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:220)

Spark 1.3.1 SparkSQL metastore exceptions

2015-06-09 Thread Needham, Guy
Hi, I'm using Spark 1.3.1 to insert into a Hive 0.12 table from a SparkSQL query. The query is a very simple select from a dummy Hive table used for benchmarking. I'm using a create table as statement to do the insert. No matter if I do that or an insert overwrite, I get the same Hive

Re: Implementing top() using treeReduce()

2015-06-09 Thread DB Tsai
Having the following code in RDD.scala works for me. PS, in the following code, I merge the smaller queue into larger one. I wonder if this will help performance. Let me know when you do the benchmark. def treeTakeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope { if (num ==

how to clear state in Spark Streaming based on emitting

2015-06-09 Thread Robert Towne
With Spark Streaming, I am maintaining a state (updateStateByKey every 30s) and emitting to file parts of that state that have been closed every 5 minutes, but only care about the last state collected. In 5m, there will be 10 updateStateByKey iterations called. For example: … val ssc = new

Re: Can a Spark App run with spark-submit write pdf files to HDFS

2015-06-09 Thread William Briggs
I don't know anything about your use case, so take this with a grain of salt, but typically if you are operating at a scale that benefits from Spark, then you likely will not want to write your output records as individual files into HDFS. Spark has built-in support for the Hadoop SequenceFile

Re: flatMap output on disk / flatMap memory overhead

2015-06-09 Thread Imran Rashid
I agree with Richard. It looks like the issue here is shuffling, and shuffle data is always written to disk, so the issue is definitely not that all the output of flatMap has to be stored in memory. If at all possible, I'd first suggest upgrading to a new version of spark -- even in 1.2, there

spark-submit does not use hive-site.xml

2015-06-09 Thread James Pirz
I am using Spark (standalone) to run queries (from a remote client) against data in tables that are already defined/loaded in Hive. I have started metastore service in Hive successfully, and by putting hive-site.xml, with proper metastore.uri, in $SPARK_HOME/conf directory, I tried to share its

Re: Linear Regression with SGD

2015-06-09 Thread DB Tsai
As Robin suggested, you may try the following new implementation. https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef Thanks. Sincerely, DB Tsai -- Blog: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
Is your cassandra installation actually listening on 9160? lsof -i :9160COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java29232 ykadiysk 69u IPv4 42152497 0t0 TCP localhost:9160 (LISTEN) ​ I am running an out-of-the box cassandra conf where rpc_address: localhost #

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg
At which point would I call cache()? I just want the runtime to spill to disk when necessary without me having to know when the necessary is. On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger c...@koeninger.org wrote: direct stream isn't a receiver, it isn't required to cache data anywhere

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-09 Thread Sourav Mazumder
From log file I noticed that the ExecutorLostFailure happens after the memory used by Executor becomes more than the Executor memory value. However, even if I increase the value of Executor Memory the Executor fails - only that it takes longer time. I'm wondering that for joining 2 Hive tables,

Spark's Scala shell killing itself

2015-06-09 Thread Chandrashekhar Kotekar
Hi, I have configured Spark to run on YARN. Whenever I start spark shell using 'spark-shell' command, it automatically gets killed. Output looks like below: ubuntu@dev-cluster-gateway:~$ ls shekhar/ edx-spark ubuntu@dev-cluster-gateway:~$ spark-shell Welcome to __ /

append file on hdfs

2015-06-09 Thread Pa Rö
hi community, i want append results to one file. if i work local my function build all right, if i run this on a yarn cluster, i lost same rows. here my function to write: points.foreach( new VoidFunctionTuple2Integer, GeoTimeDataTupel() { private static final long

Re: RDD of RDDs

2015-06-09 Thread kiran lonikar
Yes true. That's why I said if and when. But hopefully I have given correct explanation of why rdd of rdd is not possible. On 09-Jun-2015 10:22 pm, Mark Hamstra m...@clearstorydata.com wrote: That would constitute a major change in Spark's architecture. It's not happening anytime soon. On

RE: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Dong Lei
Thanks So much! I did put sleep on my code to have the UI available. Now from the UI, I can see: · In the “SparkProperty” Section, the spark.jars and spark.files are set as what I want. · In the “Classpath Entries” Section, my jars and files paths are there(with a HDFS path)

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Jörn Franke
I am not sure they work with HDFS pathes. You may want to look at the source code. Alternatively you can create a fat jar containing all jars (let your build tool set correctly METAINF). This always works. Le mer. 10 juin 2015 à 6:22, Dong Lei dong...@microsoft.com a écrit : Thanks So much!

RE: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Dong Lei
Hi Jörn: I start to check code and sadly it seems it does not work hdfs path: In HTTPFileServer.scala: def addFileToDir: …. Files.copy …. It looks like it only copy file from local to