Re: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Okehee Goh
I will..that will be great if simple UDF can return complex type. Thanks! On Fri, Jun 5, 2015 at 12:17 AM, Cheng, Hao hao.ch...@intel.com wrote: Confirmed, with latest master, we don't support complex data type for Simple Hive UDF, do you mind file an issue in jira? -Original

Re: SparkSQL DF.explode with Nulls

2015-06-05 Thread Tom Seddon
I figured it out. Needed a block style map and a check for null. The case class is just to name the transformed columns. case class Component(name: String, loadTimeMs: Long) avroFile.filter($lazyComponents.components.isNotNull) .explode($lazyComponents.components) { case

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
Please file a bug here: https://issues.apache.org/jira/browse/SPARK/ Could you also provide a way to reproduce this bug (including some datasets)? On Thu, Jun 4, 2015 at 11:30 PM, Sam Stoelinga sammiest...@gmail.com wrote: I've changed the SIFT feature extraction to SURF feature extraction and

Error when job submitting to Rest URL of master

2015-06-05 Thread pavan kumar Kolamuri
Hi i am using spark 1.3.1 and i am trying to submit spark job to the rest url of spark ( spark:host-name:6066 using standalone cluster of spark with deploy mode as cluster . Both driver and application and getting submitted after doing their work (output created) both end up with killed status.

Re: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Okehee Goh
It is Spark 1.3.1.e (it is AWS release .. I think it is close to Spark 1.3.1 with some bug fixes). My report about GenericUDF not working in SparkSQL is wrong. I tested with open-source GenericUDF and it worked fine. Just my GenericUDF which returns Map type didn't work. Sorry about false

Re: Setting S3 output file grantees for spark output files

2015-06-05 Thread Akhil Das
You could try adding the configuration in the spark-defaults.conf file. And once you run the application you can actually check on the driver UI (runs on 4040) Environment tab to see if the configuration is set properly. Thanks Best Regards On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel

How to increase the number of tasks

2015-06-05 Thread ๏̯͡๏
I have a stage that spawns 174 tasks when i run repartition on avro data. Tasks read between 512/317/316/214/173 MB of data. Even if i increase number of executors/ number of partitions (when calling repartition) the number of tasks launched remains fixed to 174. 1) I want to speed up this

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
I've changed the SIFT feature extraction to SURF feature extraction and it works... Following line was changed: sift = cv2.xfeatures2d.SIFT_create() to sift = cv2.xfeatures2d.SURF_create() Where should I file this as a bug? When not running on Spark it works fine so I'm saying it's a spark

RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Cheng, Hao
Confirmed, with latest master, we don't support complex data type for Simple Hive UDF, do you mind file an issue in jira? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, June 5, 2015 12:35 PM To: ogoh; user@spark.apache.org Subject: RE: SparkSQL : using

Re: FetchFailed Exception

2015-06-05 Thread patcharee
Hi, I has this problem before, and in my case it is because the executor/container was killed by yarn when it used more memory than allocated. You can check if your case is the same by checking yarn node manager log. Best, Patcharee On 05. juni 2015 07:25, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I see this

Access several s3 buckets, with credentials containing /

2015-06-05 Thread Pierre B
Hi list! My problem is quite simple. I need to access several S3 buckets, using different credentials.: ``` val c1 = sc.textFile(s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket/file1.csv).count val c2 = sc.textFile(s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket/file1.csv).count val c3 =

Access several s3 buckets, with credentials containing /

2015-06-05 Thread Pierre B
Hi list! My problem is quite simple. I need to access several S3 buckets, using different credentials.: ``` val c1 = sc.textFile(s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv).count val c2 = sc.textFile(s3n://[ACCESS_KEY_ID2:SECRET_ACCESS_KEY2]@bucket2/file.csv).count val c3 =

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
Yea should have emphasized that. I'm running the same code on the same VM. It's a VM with spark in standalone mode and I run the unit test directly on that same VM. So OpenCV is working correctly on that same machine but when moving the exact same OpenCV code to spark it just crashes. On Tue, Jun

Avro or Parquet ?

2015-06-05 Thread ๏̯͡๏
We currently have data in avro format and we do joins between avro and sequence file data. Will storing these datasets in Parquet make joins any faster ? The dataset sizes are beyond are between 500 to 1000 GB. -- Deepak

Articles related with how spark handles spark components(Driver,Worker,Executor, Task) failure

2015-06-05 Thread bit1...@163.com
Hi, I am looking for some articles/blogs on the topic about how spark handles the various failures,such as Driver,Worker,Executor, Task..etc Are there some articles/blogs on this topic? Detailes into source code would be the best. Thanks very much! bit1...@163.com

Re: How to increase the number of tasks

2015-06-05 Thread 李铖
Did you have a change of the value of 'spark.default.parallelism'?be a bigger number. 2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com: It may be that your system runs out of resources (ie 174 is the ceiling) due to the following 1. RDD Partition = (Spark) Task 2.

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would tachyon be appropriate here? On Friday, June 5, 2015, Evo Eftimov evo.efti...@isecc.com wrote: Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark Batch Jobs (besides anyone can put something like that in 5 min), while I am under the impression that Dmytiy is

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark Batch Jobs (besides anyone can put something like that in 5 min), while I am under the impression that Dmytiy is working on Spark Streaming app Besides the Job Server is essentially for sharing the Spark Context

Re: Saving calculation to single local file

2015-06-05 Thread ayan guha
Another option is merge partfiles after your app ends. On 5 Jun 2015 20:37, Akhil Das ak...@sigmoidanalytics.com wrote: you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be efficient if your output data is huge since one task will be doing the whole writing. Thanks Best

Re: How to increase the number of tasks

2015-06-05 Thread ๏̯͡๏
I did not change spark.default.parallelism, What is recommended value for it. On Fri, Jun 5, 2015 at 3:31 PM, 李铖 lidali...@gmail.com wrote: Did you have a change of the value of 'spark.default.parallelism'?be a bigger number. 2015-06-05 17:56 GMT+08:00 Evo Eftimov evo.efti...@isecc.com: It

Re: Access several s3 buckets, with credentials containing /

2015-06-05 Thread Steve Loughran
On 5 Jun 2015, at 08:03, Pierre B pierre.borckm...@realimpactanalytics.com wrote: Hi list! My problem is quite simple. I need to access several S3 buckets, using different credentials.: ``` val c1 = sc.textFile(s3n://[ACCESS_KEY_ID1:SECRET_ACCESS_KEY1]@bucket1/file.csv).count val c2

Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread Steve Loughran
On 2 Jun 2015, at 00:14, Dean Wampler deanwamp...@gmail.commailto:deanwamp...@gmail.com wrote: It would be nice to see the code for MapR FS Java API, but my google foo failed me (assuming it's open source)... I know that MapRFS is closed source, don't know about the java JAR. Why not ask

Re: How to increase the number of tasks

2015-06-05 Thread 李铖
just multiply 2-4 with the cpu core number of the node . 2015-06-05 18:04 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: I did not change spark.default.parallelism, What is recommended value for it. On Fri, Jun 5, 2015 at 3:31 PM, 李铖 lidali...@gmail.com wrote: Did you have a change of the

Saving calculation to single local file

2015-06-05 Thread marcos rebelo
Hi all I'm running spark in a single local machine, no hadoop, just reading and writing in local disk. I need to have a single file as output of my calculation. if I do rdd.saveAsTextFile(...) all runs ok but I get allot of files. Since I need a single file I was considering to do something

Re: Saving calculation to single local file

2015-06-05 Thread ayan guha
Just repartition to 1 partition before writing. On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote: Hi all I'm running spark in a single local machine, no hadoop, just reading and writing in local disk. I need to have a single file as output of my calculation. if I do

Re: Saving calculation to single local file

2015-06-05 Thread Akhil Das
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be efficient if your output data is huge since one task will be doing the whole writing. Thanks Best Regards On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote: Hi all I'm running spark in a single local

Re: Spark Job always cause a node to reboot

2015-06-05 Thread Steve Loughran
On 4 Jun 2015, at 15:59, Chao Chen kandy...@gmail.com wrote: But when I try to run the Pagerank from HiBench, it always cause a node to reboot during the middle of the work for all scala, java, and python versions. But works fine with the MapReduce version from the same benchmark. do

RE: redshift spark

2015-06-05 Thread Ewan Leith
That project is for reading data in from Redshift table exports stored in s3 by running commands in redshift like this: unload ('select * from venue') to 's3://mybucket/tickit/unload/' http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html The path in the parameters below is

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
And RDD.lookup() can not be invoked from Transformations e.g. maps Lookup() is an action which can be invoked only from the driver – if you want functionality like that from within Transformations executed on the cluster nodes try Indexed RDD Other options are load a Batch / Static RDD

Cassandra Submit

2015-06-05 Thread Yasemin Kaya
Hi, I am using cassandraDB in my project. I had that error *Exception in thread main java.io.IOException: Failed to open native connection to Cassandra at {127.0.1.1}:9042* I think I have to modify the submit line. What should I add or remove when I submit my project? Best, yasemin -- hiç

Re: Override Logging with spark-streaming

2015-06-05 Thread Alexander Krasheninnikov
Have you tried putting this file on local disk on each of executor nodes? That worked for me. On 05.06.2015 16:56, nib...@free.fr wrote: Hello, I want to override the log4j configuration when I start my spark job. I tried : .../bin/spark-submit --class --conf

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Charles Earl
Would the IndexedRDD feature provide what the Lookup RDD does? I'Ve been using a broadcast variable map for a similar kind of thing -- It probably is within 1GB but interested to know if the lookup (or indexed) might be better. C On Friday, June 5, 2015, Dmitry Goldenberg dgoldenberg...@gmail.com

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
It is called Indexed RDD https://github.com/amplab/spark-indexedrdd From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Friday, June 5, 2015 3:15 PM To: Evo Eftimov Cc: Yiannis Gkoufas; Olivier Girardot; user@spark.apache.org Subject: Re: How to share large resources like

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
Thanks Davies. I will file a bug later with code and single image as dataset. Next to that I can give anybody access to my vagrant VM that already has spark with OpenCV and the dataset available. Or you can setup the same vagrant machine at your place. All is automated ^^ git clone

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I can't seem to locate it exactly on Github. (Yes, to your point, our project is Spark streaming based). Thank you. On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov evo.efti...@isecc.com wrote: Oops, @Yiannis, sorry to be a

redshift spark

2015-06-05 Thread Hafiz Mujadid
Hi All, I want to read and write data to aws redshift. I found spark-redshift project at following address. https://github.com/databricks/spark-redshift in its documentation there is following code is written. import com.databricks.spark.redshift.RedshiftInputFormat val records =

Spark SQL and Streaming Results

2015-06-05 Thread Pietro Gentile
Hi all, what is the best way to perform Spark SQL queries and obtain the result tuplas in a stremaing way. In particullar, I want to aggregate data and obtain the first and incomplete results in a fast way. But it should be updated until the aggregation be completed. Best Regards.

Saving compressed textFiles from a DStream in Scala

2015-06-05 Thread doki_pen
It looks like saveAsTextFiles doesn't support the compression parameter of RDD.saveAsTextFile. Is there a way to add the functionality in my client code without patching Spark? I tried making my own saveFunc function and calling DStream.foreachRDD but ran into trouble with invoking rddToFileName

Re: Pregel runs slower and slower when each Pregel has data dependency

2015-06-05 Thread dash
Hi Heather, Please check this issue https://issues.apache.org/jira/browse/SPARK-4672. I think you can solve this problem by checkpointing your data every several iterations. Hope that helps. Best regards, Baoxu(Dash) Shi Computer Science and Engineering Department University of Notre Dame

SparkContext Threading

2015-06-05 Thread Lee McFadden
Hi all, I'm having some issues finding any kind of best practices when attempting to create Spark applications which launch jobs from a thread pool. Initially I had issues passing the SparkContext to other threads as it is not serializable. Eventually I found that adding the @transient

Re: SparkContext Threading

2015-06-05 Thread Marcelo Vanzin
On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden splee...@gmail.com wrote: Initially I had issues passing the SparkContext to other threads as it is not serializable. Eventually I found that adding the @transient annotation prevents a NotSerializableException. This is really puzzling. How are

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
You can see an example of the constructor for the class which executes a job in my opening post. I'm attempting to instantiate and run the class using the code below: ``` val conf = new SparkConf() .setAppName(appNameBase.format(Test)) val connector = CassandraConnector(conf)

Removing Keys from a MapType

2015-06-05 Thread chrish2312
Hello! I am working a column of Maps with dataframes, and I was wondering if there was a good way of removing a set of keys and their associated values from that columns. I've been using a UDF, but if there was some built in function that I'm missing, I'd love to know. Thanks! -- View this

Re: SparkContext Threading

2015-06-05 Thread Igor Berman
+1 to question about serializaiton. SparkContext is still in driver process(even if it has several threads from which you submit jobs) as for the problem, check your classpath, scala version, spark version etc. such errors usually happens when there is some conflict in classpath. Maybe you

Re: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread John Omernik
I am using Spark 1.3.1. So I don't have the 1.4.0 isEmpty. I guess I am curious on the right approach here, like I said in my original post, perhaps this isn't bad but I the exceptions I guess bother me from a programmer level... is that wrong? :) On Fri, Jun 5, 2015 at 11:07 AM, Ted Yu

Re: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread Ted Yu
John: Which Spark release are you using ? As of 1.4.0, RDD has this method: def isEmpty(): Boolean = withScope { FYI On Fri, Jun 5, 2015 at 9:01 AM, Evo Eftimov evo.efti...@isecc.com wrote: Foreachpartition callback is provided with Iterator by the Spark Frameowrk – while

Re: Shuffle strange error

2015-06-05 Thread octavian.ganea
Solved, is SPARK_PID_DIR from spark-env.sh. Changing this directory from /tmp to smthg different actually changed the error that I got, now showing where the actual error was coming from (a null pointer in my program). The first error was not helpful at all though. -- View this message in

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Sam Stoelinga
Please ignore this whole thread. It's working out of nowhere. I'm not sure what was the root cause. After I restarted the VM the previous SIFT code also started working. On Fri, Jun 5, 2015 at 10:40 PM, Sam Stoelinga sammiest...@gmail.com wrote: Thanks Davies. I will file a bug later with code

Shuffle strange error

2015-06-05 Thread octavian.ganea
Hi all, I'm using spark 1.3.1 and ran the following code: sc.textFile(path) .map(line = (getEntId(line), line)) .persist(StorageLevel.MEMORY_AND_DISK) .groupByKey .flatMap(x = func(x)) .reduceByKey((a,b) = (a + b).toShort) I get the following error in

RE: Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread Evo Eftimov
Foreachpartition callback is provided with Iterator by the Spark Frameowrk – while iterator.hasNext() …… Also check whether this is not some sort of Python Spark API bug – Python seems to be the foster child here – Scala and Java are the darlings From: John Omernik

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
It may be that your system runs out of resources (ie 174 is the ceiling) due to the following 1. RDD Partition = (Spark) Task 2. RDD Partition != (Spark) Executor 3. (Spark) Task != (Spark) Executor 4. (Spark) Task = JVM Thread 5. (Spark) Executor = JVM

RE: How to increase the number of tasks

2015-06-05 Thread Evo Eftimov
The param is for “Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when NOT set by user.” While Deepak is setting the number of partitions EXPLICITLY From: 李铖 [mailto:lidali...@gmail.com] Sent: Friday, June 5, 2015 11:08 AM To:

RE: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Evo Eftimov
Spark uses Tachyon internally ie all SERIALIZED IN-MEMORY RDDs are kept there – so if you have a BATCH RDD which is SERIALIZED IN_MEMORY then you are using Tachyon implicitly – the only difference is that if you are using Tachyon explicitly ie as a distributed, in-memory file system you can

Re: Required settings for permanent HDFS Spark on EC2

2015-06-05 Thread Nicholas Chammas
If your problem is that stopping/starting the cluster resets configs, then you may be running into this issue: https://issues.apache.org/jira/browse/SPARK-4977 Nick On Thu, Jun 4, 2015 at 2:46 PM barmaley o...@solver.com wrote: Hi - I'm having similar problem with switching from ephemeral to

Re: SparkContext Threading

2015-06-05 Thread Igor Berman
Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos? in yarn-cluster the driver program is executed inside one of nodes in cluster, so might be that driver code needs to be serialized to be sent to some node On 5 June 2015 at 22:55, Lee McFadden splee...@gmail.com wrote:

Re: SparkContext Threading

2015-06-05 Thread Marcelo Vanzin
Ignoring the serialization thing (seems like a red herring): On Fri, Jun 5, 2015 at 11:48 AM, Lee McFadden splee...@gmail.com wrote: 15/06/05 11:35:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NoSuchMethodError:

Job aborted

2015-06-05 Thread Giovanni Paolo Gibilisco
I'm running PageRank on datasets with different sizes (from 1GB to 100GB). Sometime my job is aborted showing this error: Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times, most recent failure: Lost task 0.3 in stage 4.1 (TID 2051, 9.12.247.250): java.io.FileNotFoundException:

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:30 PM Marcelo Vanzin van...@cloudera.com wrote: Ignoring the serialization thing (seems like a red herring): People seem surprised that I'm getting the Serialization exception at all - I'm not convinced it's a red herring per se, but on to the blocking issue...

Re: SparkContext Threading

2015-06-05 Thread Marcelo Vanzin
On Fri, Jun 5, 2015 at 12:55 PM, Lee McFadden splee...@gmail.com wrote: Regarding serialization, I'm still confused as to why I was getting a serialization error in the first place as I'm executing these Runnable classes from a java thread pool. I'm fairly new to Scala/JVM world and there

Re: PySpark with OpenCV causes python worker to crash

2015-06-05 Thread Davies Liu
Thanks for let us now. On Fri, Jun 5, 2015 at 8:34 AM, Sam Stoelinga sammiest...@gmail.com wrote: Please ignore this whole thread. It's working out of nowhere. I'm not sure what was the root cause. After I restarted the VM the previous SIFT code also started working. On Fri, Jun 5, 2015 at

Re: SparkContext Threading

2015-06-05 Thread Will Briggs
Your lambda expressions on the RDDs in the SecondRollup class are closing around the context, and Spark has special logic to ensure that all variables in a closure used on an RDD are Serializable - I hate linking to Quora, but there's a good explanation here:

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-05 Thread Doug Balog
Hi Yin, Thanks for the suggestion. I’m not happy about this, and I don’t agree with your position that since it wasn’t an “officially” supported feature no harm was done breaking it in the course of implementing SPARK-6908. I would still argue that it changed and therefore broke .table()’s

Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread John Omernik
Thanks all. The answers post is me too, I multi thread. That and Ted is aware to and Mapr is helping me with it. I shall report the answer of that investigation when we have it. As to reproduction, I've installed mapr file system, tired both version 4.0.2 and 4.1.0. Have mesos running along

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 1:00 PM Igor Berman igor.ber...@gmail.com wrote: Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos? Spark standalone, v1.2.1.

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 12:58 PM Marcelo Vanzin van...@cloudera.com wrote: You didn't show the error so the only thing we can do is speculate. You're probably sending the object that's holding the SparkContext reference over the network at some point (e.g. it's used by a task run in an

Re: SparkContext Threading

2015-06-05 Thread Lee McFadden
On Fri, Jun 5, 2015 at 2:05 PM Will Briggs wrbri...@gmail.com wrote: Your lambda expressions on the RDDs in the SecondRollup class are closing around the context, and Spark has special logic to ensure that all variables in a closure used on an RDD are Serializable - I hate linking to Quora,

Re: Spark 1.4.0-rc4 HiveContext.table(db.tbl) NoSuchTableException

2015-06-05 Thread Yin Huai
Hi Doug, For now, I think you can use sqlContext.sql(USE databaseName) to change the current database. Thanks, Yin On Thu, Jun 4, 2015 at 12:04 PM, Yin Huai yh...@databricks.com wrote: Hi Doug, sqlContext.table does not officially support database name. It only supports table name as the

Re: Spark 1.3.1 On Mesos Issues.

2015-06-05 Thread Tim Chen
It seems like there is another thread going on: http://answers.mapr.com/questions/163353/spark-from-apache-downloads-site-for-mapr.html I'm not particularly sure why, seems like the problem is that getting the current context class loader is returning null in this instance. Do you have some

Job aborted

2015-06-05 Thread gibbo87
I'm running PageRank on datasets with different sizes (from 1GB to 100GB). Sometime my job is aborted showing this error: Job aborted due to stage failure: Task 0 in stage 4.1 failed 4 times, most recent failure: Lost task 0.3 in stage 4.1 (TID 2051, 9.12.247.250): java.io.FileNotFoundException:

Re: Spark SQL and Streaming Results

2015-06-05 Thread Todd Nist
There use to be a project, StreamSQL ( https://github.com/thunderain-project/StreamSQL), but it appears a bit dated and I do not see it in the Spark repo, but may have missed it. @TD Is this project still active? I'm not sure what the status is but it may provide some insights on how to achieve

Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
You could take at RDD *async operations, their source code. May be that can help if getting some early results. TD On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile pietro.gentile89.develo...@gmail.com wrote: Hi all, what is the best way to perform Spark SQL queries and obtain the result

Re: Can you specify partitions?

2015-06-05 Thread amghost
Maybe you could try to implement your own Partitioner. As I remember, by default, Spark use HashPartitioner. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-you-specify-partitions-tp23156p23187.html Sent from the Apache Spark User List mailing list

RE: Cassandra Submit

2015-06-05 Thread Mohammed Guller
Check your spark.cassandra.connection.host setting. It should be pointing to one of your Cassandra nodes. Mohammed From: Yasemin Kaya [mailto:godo...@gmail.com] Sent: Friday, June 5, 2015 7:31 AM To: user@spark.apache.org Subject: Cassandra Submit Hi, I am using cassandraDB in my project. I

Re: How to run spark streaming application on YARN?

2015-06-05 Thread Saiph Kappa
I was able to run my application by just using an hadoop/YARN cluster with 1 machine. Today I tried to extend the cluster to use one more machine, but I got some problems on the yarn node manager of that new added machine: Node Manager Log: « 2015-06-06 01:41:33,379 INFO

Re: Loading CSV to DataFrame and saving it into Parquet for speedup

2015-06-05 Thread Hossein
Why not letting SparkSQL deal with parallelism? When using SparkSQL data sources you can control parallelism by specifying mapred.min.split.size and mapred.max.split.size in your Hadoop configuration. You can then repartition your data as you wish and save it as Parquet. --Hossein On Thu, May

Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
I am not sure. Saisai may be able to say more about it. TD On Fri, Jun 5, 2015 at 5:35 PM, Todd Nist tsind...@gmail.com wrote: There use to be a project, StreamSQL ( https://github.com/thunderain-project/StreamSQL), but it appears a bit dated and I do not see it in the Spark repo, but may

Re: Managing spark processes via supervisord

2015-06-05 Thread ayan guha
I use a simple python to launch cluster. I just did itfor fun, so of course not the best and lot ofmodifications can be done.But I think you arelooking for something similar? import subprocess as s from time import sleep cmd =

Re: Managing spark processes via supervisord

2015-06-05 Thread Mike Trienis
Thanks Ignor, I managed to find a fairly simple solution. It seems that the shell scripts (e.g. .start-master.sh, start-slave.sh) end up executing /bin/spark-class which is always run in the foreground. Here is a solution I provided on stackoverflow: -

Spark Streaming for Each RDD - Exception on Empty

2015-06-05 Thread John Omernik
Is there pythonic/sparkonic way to test for an empty RDD before using the foreachRDD? Basically I am using the Python example https://spark.apache.org/docs/latest/streaming-programming-guide.html to put records somewhere When I have data, it works fine, when I don't I get an exception. I am not