Spark SQL queries hive table, real time ?

2015-07-06 Thread spierki
Hello, I'm actually asking my self about performance of using Spark SQL with Hive to do real time analytics. I know that Hive has been created for batch processing, and Spark is use to do fast queries. But, use Spark SQL with Hive will allow me to do real time queries ? Or it just will make

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException in spark with mysql database

2015-07-06 Thread Hafiz Mujadid
Hi! I am trying to load data from my sql database using following code val query=select * from +table+ val url = jdbc:mysql:// + dataBaseHost + : + dataBasePort + / + dataBaseName + ?user= + db_user + password= + db_pass val sc = new SparkContext(new

Re: Split RDD into two in a single pass

2015-07-06 Thread Daniel Darabos
This comes up so often. I wonder if the documentation or the API could be changed to answer this question. The solution I found is from http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job. You basically write the items into two directories in a single

Re: Re: Application jar file not found exception when submitting application

2015-07-06 Thread bit1...@163.com
Thanks Shixiong for the reply. Yes, I confirm that the file exists there ,simply checks with ls -l /data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar bit1...@163.com From: Shixiong Zhu Date: 2015-07-06 18:41 To: bit1...@163.com CC: user Subject: Re:

Spark equivalent for Oracle's analytical functions

2015-07-06 Thread gireeshp
Is there any equivalent of Oracle's *analytical functions* in Spark SQL. For example, if I have following data set (say table T): /EID|DEPT 101|COMP 102|COMP 103|COMP 104|MARK/ In Oracle, I can do something like /select EID, DEPT, count(1) over (partition by DEPT) CNT from T;/ to get:

[SPARK-SQL] Re-use col alias in the select clause to avoid sub query

2015-07-06 Thread Hao Ren
Hi, I want to re-use column alias in the select clause to avoid sub query. For example: select check(key) as b, abs(b) as abs, value1, value2, ..., value30 from test The query above does not work, because b is not defined in the test's schema. In stead, I should change the query to the

Application jar file not found exception when submitting application

2015-07-06 Thread bit1...@163.com
Hi, I have following shell script that will submit the application to the cluster. But whenever I start the application, I encounter FileNotFoundException, after retrying for serveral times, I can successfully submit it! SPARK=/data/software/spark-1.3.1-bin-2.4.0

[SparkR] Float type coercion with hiveContext

2015-07-06 Thread Evgeny Sinelnikov
Hello, I'm got a trouble with float type coercion on SparkR with hiveContext. result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot

Re: Application jar file not found exception when submitting application

2015-07-06 Thread Shixiong Zhu
Before running your script, could you confirm that /data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar exists? You might forget to build this jar. Best Regards, Shixiong Zhu 2015-07-06 18:14 GMT+08:00 bit1...@163.com bit1...@163.com: Hi, I have following

Spark's equivalent for Analytical functions in Oracle

2015-07-06 Thread Gireesh Puthumana
Hi there, I would like to check with you whether there is any equivalent functions of Oracle's analytical functions in Spark SQL. For example, if I have following data set (table T): *EID|DEPT* *101|COMP* *102|COMP* *103|COMP* *104|MARK* In Oracle, I can do something like *select EID, DEPT,

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Denny Lee
I went ahead and tested your file and the results from the tests can be seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94. Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default} the query ran without any issues. I was able to recreate the issue with {Java

writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
I have a requirement to write in kafka queue from a spark streaming application. I am using spark 1.2 streaming. Since different executors in spark are allocated at each run so instantiating a new kafka producer at each run seems a costly operation .Is there a way to reuse objects in processing

Re: How to shut down spark web UI?

2015-07-06 Thread Shixiong Zhu
You can set spark.ui.enabled to false to disable the Web UI. Best Regards, Shixiong Zhu 2015-07-06 17:05 GMT+08:00 luohui20...@sina.com: Hello there, I heard that there is some way to shutdown Spark WEB UI, is there a configuration to support this? Thank you.

Re: Unable to start spark-sql

2015-07-06 Thread sandeep vura
Thanks alot AKhil On Mon, Jul 6, 2015 at 12:57 PM, sandeep vura sandeepv...@gmail.com wrote: It Works !!! On Mon, Jul 6, 2015 at 12:40 PM, sandeep vura sandeepv...@gmail.com wrote: oK Let me try On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its

Re: Unable to start spark-sql

2015-07-06 Thread sandeep vura
It Works !!! On Mon, Jul 6, 2015 at 12:40 PM, sandeep vura sandeepv...@gmail.com wrote: oK Let me try On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its complaining for a jdbc driver. Add it in your driver classpath like: ./bin/spark-sql --driver-class-path

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-07-06 Thread Tathagata Das
I have already opened a JIRA about this. https://issues.apache.org/jira/browse/SPARK-8743 On Mon, Jul 6, 2015 at 1:02 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I haven't been able to reproduce the error reliably, I will open a JIRA as soon as I can Greetings,

Re: java.lang.IllegalArgumentException: A metric named ... already exists

2015-07-06 Thread Juan Rodríguez Hortalá
Hi, I haven't been able to reproduce the error reliably, I will open a JIRA as soon as I can Greetings, Juan 2015-06-23 21:57 GMT+02:00 Tathagata Das t...@databricks.com: Aaah this could be potentially major issue as it may prevent metrics from restarted streaming context be not published.

Spark-CSV: Multiple delimiters and Null fields support

2015-07-06 Thread Anas Sherwani
Hi all, Apparently, we can only specify character delimiter for tokenizing data using Spark-CSV. But what if we have a log file with multiple delimiters or even a multi-character delimiter? e.g. (field1,field2:field3) with delimiters [,:] and (field1::field2::field3) with a single multi-character

Split RDD into two in a single pass

2015-07-06 Thread Anand Nalya
Hi, I've a RDD which I want to split into two disjoint RDDs on with a boolean function. I can do this with the following val rdd1 = rdd.filter(f) val rdd2 = rdd.filter(fnot) I'm assuming that each of the above statement will traverse the RDD once thus resulting in 2 passes. Is there a way of

How to shut down spark web UI?

2015-07-06 Thread luohui20001
Hello there, I heard that there is some way to shutdown Spark WEB UI, is there a configuration to support this? Thank you. Thanksamp;Best regards! San.Luo

Re: Spark's equivalent for Analytical functions in Oracle

2015-07-06 Thread ayan guha
Its available in Spark 1.4 under dataframe window operations. Apparently programming doc doesnot mention it, you need to look at the apis. On Mon, Jul 6, 2015 at 8:50 PM, Gireesh Puthumana gireesh.puthum...@augmentiq.in wrote: Hi there, I would like to check with you whether there is any

kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
In spark streaming 1.2 , Is offset of kafka message consumed are updated in zookeeper only after writing in WAL if WAL and checkpointig are enabled or is it depends upon kafkaparams while initialing the kafkaDstream. MapString,String kafkaParams = new HashMapString, String();

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
If you’re using WAL with Kafka, Spark Streaming will ignore this configuration(autocommit.enable) and explicitly call commitOffset to update offset to Kafka AFTER WAL is done. No matter what you’re setting with autocommit.enable, internally Spark Streaming will set it to false to turn off

Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee
Within the context of your question, Spark SQL utilizing the Hive context is primarily about very fast queries. If you want to use real-time queries, I would utilize Spark Streaming. A couple of great resources on this topic include Guest Lecture on Spark Streaming in Stanford CME 323:

Re: [SparkR] Float type coercion with hiveContext

2015-07-06 Thread Evgeny Sinelnikov
I used spark 1.4.0 binaries from official site: http://spark.apache.org/downloads.html And running it on: * Hortonworks HDP 2.2.0.0-2041 * with Hive 0.14 * with disabled hooks for Application Timeline Servers (ATSHook) in hive-site.xml (commented hive.exec.failure.hooks, hive.exec.post.hooks,

How Will Spark Execute below Code - Driver and Executors

2015-07-06 Thread Ashish Soni
Hi All , If some one can help me understand as which portion of the code gets executed on Driver and which portion will be executed on executor from the below code it would be a great help I have to load data from 10 Tables and then use that data in various manipulation and i am using SPARK SQL

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
Please see the inline comments. From: Shushant Arora [mailto:shushantaror...@gmail.com] Sent: Monday, July 6, 2015 8:51 PM To: Shao, Saisai Cc: user Subject: Re: kafka offset commit in spark streaming 1.2 So If WAL is disabled, how developer can commit offset explicitly in spark streaming app

RE: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shao, Saisai
If you disable WAL, Spark Streaming itself will not manage any offset related things, is auto commit is enabled by true, Kafka itself will update offsets in a time-based way, if auto commit is disabled, no any part will call commitOffset, you need to call this API yourself. Also Kafka’s offset

Re: kafka offset commit in spark streaming 1.2

2015-07-06 Thread Shushant Arora
So If WAL is disabled, how developer can commit offset explicitly in spark streaming app since we don't write code which will be executed in receiver ? Plus since offset commitment is asynchronoous, is it possible -it may happen last offset is not commited yet and next stream batch started on

Re: writing to kafka using spark streaming

2015-07-06 Thread Cody Koeninger
Use foreachPartition, and allocate whatever the costly resource is once per partition. On Mon, Jul 6, 2015 at 6:11 AM, Shushant Arora shushantaror...@gmail.com wrote: I have a requirement to write in kafka queue from a spark streaming application. I am using spark 1.2 streaming. Since

Converting spark JDBCRDD to DataFrame

2015-07-06 Thread Hafiz Mujadid
Hi all! what is the most efficient way to convert jdbcRDD to DataFrame. any example? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Converting-spark-JDBCRDD-to-DataFrame-tp23647.html Sent from the Apache Spark User List mailing list archive at

Re: Restarting Spark Streaming Application with new code

2015-07-06 Thread Cody Koeninger
You shouldn't rely on being able to restart from a checkpoint after changing code, regardless of whether the change was explicitly related to serialization. If you are relying on checkpoints to hold state, specifically which offsets have been processed, that state will be lost if you can't

Re: How Will Spark Execute below Code - Driver and Executors

2015-07-06 Thread ayan guha
Join happens on executor. Else spark would not be much of a distributed computing engine :) Reads happen on executor too. Your options are passed to executors and conn objects are created in executors. On 6 Jul 2015 22:58, Ashish Soni asoni.le...@gmail.com wrote: Hi All , If some one can help

Re: DESCRIBE FORMATTED doesn't work in Hive Thrift Server?

2015-07-06 Thread Ted Yu
What version of Hive and Spark are you using ? Cheers On Sun, Jul 5, 2015 at 10:53 PM, Rex Xiong bycha...@gmail.com wrote: Hi, I try to use for one table created in spark, but it seems the results are all empty, I want to get metadata for table, what's other options? Thanks

User Defined Functions - Execution on Clusters

2015-07-06 Thread Eskilson,Aleksander
Hi there, I’m trying to get a feel for how User Defined Functions from SparkSQL (as written in Python and registered using the udf function from pyspark.sql.functions) are run behind the scenes. Trying to grok the source it seems that the native Python function is serialized for distribution

Re: How to recover in case user errors in streaming

2015-07-06 Thread Tathagata Das
1. onBatchError is not a bad idea. 2. It works for all the Kafka Direct API and files as well. They are have batches. However you will not get the number of records for the file stream. 3. Mind giving an example of the exception you would like to see caught? TD On Wed, Jul 1, 2015 at 10:35 AM,

Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-06 Thread MorEru
I have a Spark standalone cluster with 2 workers - Master and one slave thread run on a single machine -- Machine 1 Another slave running on a separate machine -- Machine 2 I am running a spark shell in the 2nd machine that reads a file from hdfs and does some calculations on them and stores the

Re: How to create empty RDD

2015-07-06 Thread Richard Marscher
This should work val output: RDD[(DetailInputRecord, VISummary)] = sc.paralellize(Seq.empty[(DetailInputRecord, VISummary)]) On Mon, Jul 6, 2015 at 5:11 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I need to return an empty RDD of type val output: RDD[(DetailInputRecord, VISummary)] This

Spark application with a RESTful API

2015-07-06 Thread Sagi r
Hi, I've been researching spark for a couple of months now, and I strongly believe it can solve our problem. We are developing an application that allows the user to analyze various sources of information. We are dealing with non-technical users, so simply giving them and interactive shell won't

How to create empty RDD

2015-07-06 Thread ๏̯͡๏
I need to return an empty RDD of type val output: RDD[(DetailInputRecord, VISummary)] This does not work val output: RDD[(DetailInputRecord, VISummary)] = new RDD() as RDD is abstract class. How do i create empty RDD ? -- Deepak

How to create a LabeledPoint RDD from a Data Frame

2015-07-06 Thread Sourav Mazumder
Hi, I have a Dataframe which I want to use for creating a RandomForest model using MLLib. The RandonForest model needs a RDD with LabeledPoints. Wondering how do I convert the DataFrame to LabeledPointRDD Regards, Sourav

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
You meant SPARK_REPL_OPTS? I did a quick search. Looks like it has been removed since 1.0. I think it did not affect the behavior of the shell. On Mon, Jul 6, 2015 at 9:04 AM, Simeon Simeonov s...@swoop.com wrote: Yin, that did the trick. I'm curious what was the effect of the environment

How do we control output part files created by Spark job?

2015-07-06 Thread kachau
Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file

How to call hiveContext.sql() on all the Hive partitions in parallel?

2015-07-06 Thread kachau
Hi I have to fire few insert into queries which uses Hive partitions. I have two Hive partitions named server and date. Now I execute insert into queries using hiveContext as shown below query works fine hiveContext.sql(insert into summary1 partition(server='a1',date='2015-05-22') select from

Re: Converting spark JDBCRDD to DataFrame

2015-07-06 Thread Michael Armbrust
Use the built in JDBC data source: https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases On Mon, Jul 6, 2015 at 6:42 AM, Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi all! what is the most efficient way to convert jdbcRDD to DataFrame. any example?

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
Yeah, creating a new producer at the granularity of partitions may not be that costly. On Mon, Jul 6, 2015 at 6:40 AM, Cody Koeninger c...@koeninger.org wrote: Use foreachPartition, and allocate whatever the costly resource is once per partition. On Mon, Jul 6, 2015 at 6:11 AM, Shushant

Cluster sizing for recommendations

2015-07-06 Thread Danny Yates
Hi, I'm having trouble building a recommender and would appreciate a few pointers. I have 350,000,000 events which are stored in roughly 500,000 S3 files and are formatted as semi-structured JSON. These events are not all relevant to making recommendations. My code is (roughly): case class

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Sathish Kumaran Vairavelu
Try coalesce function to limit no of part files On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote: Hi I am having couple of Spark jobs which processes thousands of files every day. File size may very from MBs to GBs. After finishing job I usually save using the following code

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
whats the difference between foreachPartition vs mapPartitions for a Dtstream both works at partition granularity? One is an operation and another is action but if I call an opeartion afterwords mapPartitions also, which one is more efficient and recommeded? On Tue, Jul 7, 2015 at 12:21 AM,

Re: writing to kafka using spark streaming

2015-07-06 Thread Tathagata Das
Both have same efficiency. The primary difference is that one is a transformation (hence is lazy, and requires another action to actually execute), and the other is an action. But it may be a slightly better design in general to have transformations be purely functional (that is, no external side

Master doesn't start, no logs

2015-07-06 Thread maxdml
Hi, I've been compiling spark 1.4.0 with SBT, from the source tarball available on the official website. I cannot run spark's master, even tho I have built and run several other instance of spark on the same machine (spark 1.3, master branch, pre built 1.4, ...) /starting

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been processed. Unless there are errors where the batch completely fails to get processed, in which case the point is moot. Just reinforcing the concept further. Additional information: This is true in the default configuration.

Job consistently failing after leftOuterJoin() - oddly sized / non-uniform partitions

2015-07-06 Thread Mohammed Omer
Afternoon all, Really loving this project and the community behind it. Thank you all for your hard work. This past week, though, I've been having a hard time getting my first deployed job to run without failing at the same point every time: Right after a leftOuterJoin, most partitions (600

Re: Random Forest in MLLib

2015-07-06 Thread Feynman Liang
Not yet, though work on this feature has begun (SPARK-5133 https://issues.apache.org/jira/browse/SPARK-5133) On Mon, Jul 6, 2015 at 4:46 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, Is there a way to get variable importance for RandomForest model created using MLLib ? This way

Re: User Defined Functions - Execution on Clusters

2015-07-06 Thread Davies Liu
Currently, Python UDFs run in a Python instances, are MUCH slower than Scala ones (from 10 to 100x). There is JIRA to improve the performance: https://issues.apache.org/jira/browse/SPARK-8632, After that, they will be still much slower than Scala ones (because Python is lower and the overhead for

How does executor cores change the spark job behavior ?

2015-07-06 Thread ๏̯͡๏
I have a simple job , that reads data = union = filter = map and the count 1 Job started 2402 tasks read 149G of input. I started the job with different number of executors 1) 1 -- 8.3 mins 2) 2 -- 5.6 mins 3) 3 -- 3.1 mins 1) Why is increasing the cores speading up this app ? 2) I started

Random Forest in MLLib

2015-07-06 Thread Sourav Mazumder
Hi, Is there a way to get variable importance for RandomForest model created using MLLib ? This way one can know among multiple features which are the one contributing the most to the dependent variable. Regards, Sourav

JVM is not ready after 10 seconds

2015-07-06 Thread ashishdutt
Hi, I am trying to connect a worker to the master. The spark master is on cloudera manager and I know the master IP address and port number. I downloaded the spark binary for CDH4 on the worker machine and then when I try to invoke the command sc = sparkR.init(master=ip address:port number) I

Re: Job consistently failing after leftOuterJoin() - oddly sized / non-uniform partitions

2015-07-06 Thread ayan guha
You can bump up number of partition by a parameter in join operator. However you have a data skew problem which you need to resolve using a reasonable partition by function On 7 Jul 2015 08:57, Mohammed Omer beancinemat...@gmail.com wrote: Afternoon all, Really loving this project and the

RE: How to create a LabeledPoint RDD from a Data Frame

2015-07-06 Thread Mohammed Guller
Have you looked at the new Spark ML library? You can use a DataFrame directly with the Spark ML API. https://spark.apache.org/docs/latest/ml-guide.html Mohammed From: Sourav Mazumder [mailto:sourav.mazumde...@gmail.com] Sent: Monday, July 6, 2015 10:29 AM To: user Subject: How to create a

RE: How do we control output part files created by Spark job?

2015-07-06 Thread Mohammed Guller
You could repartition the dataframe before saving it. However, that would impact the parallelism of the next jobs that reads these file from HDFS. Mohammed -Original Message- From: kachau [mailto:umesh.ka...@gmail.com] Sent: Monday, July 6, 2015 10:23 AM To: user@spark.apache.org

回复:Re: How to shut down spark web UI?

2015-07-06 Thread luohui20001
got it ,thanks. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Shixiong Zhu zsxw...@gmail.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: How to shut down spark web UI? 日期:2015年07月06日 17点31分 You can set spark.ui.enabled to false

Re: Spark standalone cluster - Output file stored in temporary directory in worker

2015-07-06 Thread maxdml
Can you share your hadoop configuration file please? - etc/hadoop/core-site.xml - etc/hadoop/hdfs-site.xml - etc/hadoop/hadoo-env.sh AFAIK, the following properties should be configured: hadoop.tmp.dir, dfs.namenode.name.dir, dfs.datanode.data.dir and dfs.namenode.checkpoint.dir Otherwise, an

JVM is not ready after 10 seconds.

2015-07-06 Thread Ashish Dutt
Hi, I am trying to connect a worker to the master. The spark master is on cloudera manager and I know the master IP address and port number. I downloaded the spark binary for CDH4 on the worker machine and then when I try to invoke the command sc = sparkR.init(master=ip address:port number) I

Re: JVM is not ready after 10 seconds

2015-07-06 Thread Ashish Dutt
Hello Shivaram, Thank you for your response. Being a novice at this stage can you also tell how to configure or set the execute permission for the spark-submit file? Thank you for your time. Sincerely, Ashish Dutt On Tue, Jul 7, 2015 at 9:21 AM, Shivaram Venkataraman

Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Khaled Hammouda
Great! That's what I gathered from the thread titled Serial batching with Spark Streaming, but thanks for confirming this again. On 6 July 2015 at 15:31, Tathagata Das t...@databricks.com wrote: Yes, RDD of batch t+1 will be processed only after RDD of batch t has been processed. Unless there

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-06 Thread Gylfi
Hi. Just a few quick comment on your question. If you drill into (click the link of the subtasks) you can get more detailed view of the tasks. One of the things reported is the time for serialization. If that is your dominant factor it should be reflected there, right? Are you sure the

Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Dean Wampler
Here's our home page: http://www.meetup.com/Chicago-Spark-Users/ Thanks, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler

RE: Spark SQL queries hive table, real time ?

2015-07-06 Thread Mohammed Guller
Hi Florian, It depends on a number of factors. How much data are you querying? Where is the data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)? In theory, it is possible to use Spark SQL for real-time queries, but cost increases as the data size grows. If you can store all

How to debug java.io.OptionalDataException issues

2015-07-06 Thread Yana Kadiyska
Hi folks, suffering from a pretty strange issue: Is there a way to tell what object is being successfully serialized/deserialized? I have a maven-installed jar that works well when fat jarred within another, but shows the following stack when marked as provided and copied to the runtime

Spark Unit tests - RDDBlockId not found

2015-07-06 Thread Malte
I am running unit tests on Spark 1.3.1 with sbt test and besides the unit tests being incredibly slow I keep running into java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId issues. Usually this means a dependency issue, but I wouldn't know from where... Any help is greatly

Re: JVM is not ready after 10 seconds

2015-07-06 Thread Shivaram Venkataraman
When I've seen this error before it has been due to the spark-submit file (i.e. `C:\spark-1.4.0\bin/bin/spark-submit.cmd`) not having execute permissions. You can try to set execute permission and see if it fixes things. Also we have a PR open to fix a related problem at

RE: Spark application with a RESTful API

2015-07-06 Thread Mohammed Guller
It is not a bad idea. Many people use this approach. Mohammed -Original Message- From: Sagi r [mailto:stsa...@gmail.com] Sent: Monday, July 6, 2015 1:58 PM To: user@spark.apache.org Subject: Spark application with a RESTful API Hi, I've been researching spark for a couple of months

Re: JVM is not ready after 10 seconds

2015-07-06 Thread Ashish Dutt
Hi, These are the settings for my spark-conf file on the worker machine from where I am trying to access the spark server. I think I need to first configure the spark-submit file too but I do not know how,, Can somebody suggest me ? # Default system properties included when running

Re: How do we control output part files created by Spark job?

2015-07-06 Thread Gylfi
Hi. Have you tried to repartition the finalRDD before saving? This link might help. http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html Regards, Gylfi. -- View this message in context:

Re: how to black list nodes on the cluster

2015-07-06 Thread Gylfi
Hi. Have you tried to enable speculative execution? This will allow Spark to run the same sub-task of the job on other available slots when slow tasks are encountered. This can be passed at execution time with the params are: spark.speculation spark.speculation.interval

The auxService:spark_shuffle does not exist

2015-07-06 Thread roy
I am getting following error for simple spark job I am running following command /spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar/ but job doesn't show any

Re: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException in spark with mysql database

2015-07-06 Thread Sathish Kumaran Vairavelu
Try including alias in the query. val query=(select * from +table+) a On Mon, Jul 6, 2015 at 3:38 AM Hafiz Mujadid hafizmujadi...@gmail.com wrote: Hi! I am trying to load data from my sql database using following code val query=select * from +table+ val url = jdbc:mysql:// +

Re: Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Denny Lee
Hey Dean, Sure, will take care of this. HTH, Denny On Tue, Jul 7, 2015 at 10:07 Dean Wampler deanwamp...@gmail.com wrote: Here's our home page: http://www.meetup.com/Chicago-Spark-Users/ Thanks, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: How to recover in case user errors in streaming

2015-07-06 Thread Li,ZhiChao
Hi Cody and TD, Just trying to understanding this under the hook, but cannot find any place for this specific logic: once you reach max failures the whole stream will stop. If possible, could you point me to the right direction ? For my understanding, the exception throw from the job would

Re: writing to kafka using spark streaming

2015-07-06 Thread Shushant Arora
On using foreachPartition jobs get created are not displayed on driver console but are visible on web ui. On driver it creates some stage statistics of form [Stage 2: (0 + 2) / 5] and disappeared . I am using foreachPartition as :

Re: How to create empty RDD

2015-07-06 Thread Wei Zhou
I userd val output: RDD[(DetailInputRecord, VISummary)] = sc.emptyRDD[(DetailInputRecord, VISummary)] to create empty RDD before. Give it a try, it might work for you too. 2015-07-06 14:11 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: I need to return an empty RDD of type val output:

Unable to start spark-sql

2015-07-06 Thread sandeep vura
Hi Sparkers, I am unable to start spark-sql service please check the error as mentioned below. Exception in thread main java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at

Re: Spark custom streaming receiver not storing data reliably?

2015-07-06 Thread Ajit Bhingarkar
Jorn, Thanks for your response. I am pasting below a snippet of code which shows drools integration when facts/events are picked up after reading through a File (FileReader-readLine()), it works as expected and I have tested it for wide range of record data in a File. Same code doesn't work

Re: Spark got stuck with BlockManager after computing connected components using GraphX

2015-07-06 Thread Akhil Das
If you don't want those logs flood your screen, you can disable it simply with: import org.apache.log4j.{Level, Logger} Logger.getLogger(org).setLevel(Level.OFF) Logger.getLogger(akka).setLevel(Level.OFF) Thanks Best Regards On Sun, Jul 5, 2015 at 7:27 PM, Hellen

Re: cores and resource management

2015-07-06 Thread Akhil Das
Try with *spark.cores.max*, executor cores is used when you usually run it on yarn mode. Thanks Best Regards On Mon, Jul 6, 2015 at 1:22 AM, nizang ni...@windward.eu wrote: hi, We're running spark 1.4.0 on ec2, with 6 machines, 4 cores each. We're trying to run an application on a number of

Re: How to use caching in Spark Actions or Output operations?

2015-07-06 Thread Himanshu Mehra
Hi Sudarshan, As far as i understand your problem you should take a look at broadcast variables in spark. here you have the docs https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables . Thanks Himanshu -- View this message in context:

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
While the job is running, just look in the directory and see whats the root cause of it (is it the logs? is it the shuffle? etc). Here's a few configuration options which you can try: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation:

Re: java.io.IOException: No space left on device--regd.

2015-07-06 Thread Akhil Das
You can also set these in the spark-env.sh file : export SPARK_WORKER_DIR=/mnt/spark/ export SPARK_LOCAL_DIR=/mnt/spark/ Thanks Best Regards On Mon, Jul 6, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: While the job is running, just look in the directory and see whats the

How does Spark streaming move data around ?

2015-07-06 Thread Sela, Amit
I know that Spark is using data parallelism over, say, HDFS - optimally running computations on local data (aka data locality). I was wondering how Spark streaming moves data (messages) around? since the data is streamed in as DStreams and is not on a distributed FS like HDFS. Thanks!

Re: Unable to start spark-sql

2015-07-06 Thread Akhil Das
Its complaining for a jdbc driver. Add it in your driver classpath like: ./bin/spark-sql --driver-class-path /home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar Thanks Best Regards On Mon, Jul 6, 2015 at 11:42 AM, sandeep vura sandeepv...@gmail.com wrote: Hi Sparkers, I am

Re: Unable to start spark-sql

2015-07-06 Thread sandeep vura
oK Let me try On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Its complaining for a jdbc driver. Add it in your driver classpath like: ./bin/spark-sql --driver-class-path /home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar Thanks Best Regards

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote: I would guess the opposite is true for highly iterative benchmarks (common in graph processing

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Yin Huai
Hi Sim, I think the right way to set the PermGen Size is through driver extra JVM options, i.e. --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m Can you try it? Without this conf, your driver's PermGen size is still 128m. Thanks, Yin On Mon, Jul 6, 2015 at 4:07 AM, Denny Lee

Re: Streaming: updating broadcast variables

2015-07-06 Thread Conor Fennell
Hi James, The code below shows one way how you can update the broadcast variable on the executors: // ... events stream setup var startTime = new Date().getTime() var hashMap = HashMap(1 - (1, 1), 2 - (2, 2)) var hashMapBroadcast =

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-06 Thread Simeon Simeonov
Yin, that did the trick. I'm curious what was the effect of the environment variable, however, as the behavior of the shell changed from hanging to quitting when the env var value got to 1g. /Sim Simeon Simeonov, Founder CTO, Swoophttp://swoop.com/ @simeonshttp://twitter.com/simeons |