Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-02 Thread Nicolae Marasoiu
Hi, Set 10ms and spark.streaming.backpressure.enabled=true This should automatically delay the next batch until the current one is processed, or at least create that balance over a few batches/periods between the consume/process rate vs ingestion rate. Nicu

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-02 Thread Sourabh Chandak
Thanks Cody, will try to do some estimation. Thanks Nicolae, will try out this config. Thanks, Sourabh On Thu, Oct 1, 2015 at 11:01 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > Set 10ms and spark.streaming.backpressure.enabled=true > > > This should automatically

Checkpointing is super slow

2015-10-02 Thread Sourabh Chandak
Hi, I have a receiverless kafka streaming job which was started yesterday evening and was running fine till 4 PM today. Suddenly post that writing of checkpoint has slowed down and it is now not able to catch up with the incoming data. We are using the DSE stack with Spark 1.2 and Cassandra for

Re: How to save DataFrame as a Table in Hbase?

2015-10-02 Thread Nicolae Marasoiu
Hi, Phoenix, an SQL coprocessor for HBase has ingestion integration with dataframes in 4.x version. For HBase and RDD in general there are multiple solutions: hbase-spark module by Cloudera, which wil be part of a future HBase release, hbase-rdd by unicredit, and many others. I am not sure if

Re: calling persist would cause java.util.NoSuchElementException: key not found:

2015-10-02 Thread Shixiong Zhu
Do you have the full stack trace? Could you check if it's same as https://issues.apache.org/jira/browse/SPARK-10422 Best Regards, Shixiong Zhu 2015-10-01 17:05 GMT+08:00 Eyad Sibai : > Hi > > I am trying to call .persist() on a dataframe but once I execute the next >

Addition of Meetup Group - Sydney, Mebourne Australia

2015-10-02 Thread Andy Huang
Hi, Could someone please help with adding the following Spark Meetup Groups to the Meetups section of http://spark.apache.org/community.html Sydney Spark Meetup Group: http://www.meetup.com/Sydney-Apache-Spark-User-Group/ Melbourne Spark Meetup Group:

Fwd: Add row IDs column to data frame

2015-10-02 Thread Josh Levy-Kramer
Hi, Iv created a simple example using the withColumn method but throws an error. Try: val df = List( (1,1), (1,1), (1,2), (2,2) ).toDF("col1", "col2") val index_col = sqlContext.range( df.count() ).col("id") val df_with_index = df.withColumn("index", index_col) The error I get is:

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-02 Thread Cody Koeninger
But turning backpressure on won't stop you from choking on the first batch if you're doing e.g. some kind of in-memory aggregate that can't handle that many records at once. On Fri, Oct 2, 2015 at 1:10 AM, Sourabh Chandak wrote: > Thanks Cody, will try to do some

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964
No problem. >From the mapper side, Spark is very similar as the MapReduce; but on the >reducer fetching side, MR uses sort merge vs Spark uses HashMap. So keep this in mind that you can get data automatically sorted on the reducer side on MR, but not in Spark. Spark's performance comes:Cache

Re: Checkpointing is super slow

2015-10-02 Thread Cody Koeninger
Why are you sure it's checkpointing speed? Have you compared it against checkpointing to hdfs, s3, or local disk? On Fri, Oct 2, 2015 at 1:17 AM, Sourabh Chandak wrote: > Hi, > > I have a receiverless kafka streaming job which was started yesterday > evening and was

RE: Accumulator of rows?

2015-10-02 Thread Saif.A.Ellafi
Thank you, exactly what I was looking for. I have read of it before but never associated. Saif From: Adrian Tanase [mailto:atan...@adobe.com] Sent: Friday, October 02, 2015 8:24 AM To: Ellafi, Saif A.; user@spark.apache.org Subject: Re: Accumulator of rows? Have you seen window functions?

Re: automatic start of streaming job on failure on YARN

2015-10-02 Thread Steve Loughran
On 1 Oct 2015, at 16:52, Adrian Tanase > wrote: This happens automatically as long as you submit with cluster mode instead of client mode. (e.g. ./spark-submit —master yarn-cluster …) The property you mention would help right after that, although

Re: Getting spark application driver ID programmatically

2015-10-02 Thread Igor Berman
if driver id is application id then yes you can do it with String appId = ctx.sc().applicationId(); //when ctx is java context On 1 October 2015 at 20:44, Snehal Nagmote wrote: > Hi , > > I have use case where we need to automate start/stop of spark streaming >

Re: Shuffle Write v/s Shuffle Read

2015-10-02 Thread Zoltán Zvara
Hi, Shuffle output goes to local disk each time, as far as I know, never to memory. On Fri, Oct 2, 2015 at 1:26 PM Adrian Tanase wrote: > I’m not sure this is related to memory management – the shuffle is the > central act of moving data around nodes when the computations

How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread unk1102
Hi I have registed my hive UDF using the following code: hiveContext.udf().register("MyUDF",new UDF1(String,String)) { public String call(String o) throws Execption { //bla bla } },DataTypes.String); Now I want to use above MyUDF in DataFrame. How do we use it? I know how to use it in a sql and

Re: Shuffle Write v/s Shuffle Read

2015-10-02 Thread Adrian Tanase
I’m not sure this is related to memory management – the shuffle is the central act of moving data around nodes when the computations need the data on another node (E.g. Group by, sort, etc) Shuffle read and shuffle write should be mirrored on the left/right side of a shuffle between 2 stages.

Compute Real-time Visualizations using spark streaming

2015-10-02 Thread Sureshv
Hi, I am new to Spark and I would like know how to compute (dynamically) real-time visualizations using Spark streaming (Kafka). Use case : We have Real-time analytics dashboard (reports and dashboard), user can define report (visualization) with certain parameters like, refresh period, choose

saveAsTextFile creates an empty folder in HDFS

2015-10-02 Thread jarias
Dear list, I'm experimenting a problem when trying to write any RDD to HDFS. I've tried with minimal examples, scala programs and pyspark programs both in local and cluster modes and as standalone applications or shells. My problem is that when invoking the write command, a task is executed but

Re: Accumulator of rows?

2015-10-02 Thread Adrian Tanase
Have you seen window functions? https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html From: "saif.a.ell...@wellsfargo.com" Date: Thursday, October 1, 2015 at 9:44 PM To: "user@spark.apache.org"

from groupBy return a DataFrame without aggregation?

2015-10-02 Thread Saif.A.Ellafi
Given ID, DATE, I need all sorted dates per ID, what is the easiest way? I got this but I don't like it: val l = zz.groupBy("id", " dt").agg($"dt".as("dummy")).sort($"dt".asc) Saif

Re: HDFS small file generation problem

2015-10-02 Thread nibiau
Hello, Yes but : - In the Java API I don't find a API to create a HDFS archive - As soon as I receive a message (with messageID) I need to replace the old existing file by the new one (name of file being the messageID), is it possible with archive ? Tks Nicolas - Mail original - De:

Re: Spark Streaming over YARN

2015-10-02 Thread Dibyendu Bhattacharya
Hi, If you need to use Receiver based approach , you can try this one : https://github.com/dibbhatt/kafka-spark-consumer This is also part of Spark packages : http://spark-packages.org/package/dibbhatt/kafka-spark-consumer You just need to specify the number of Receivers you want for desired

Re: HDFS small file generation problem

2015-10-02 Thread Brett Antonides
I had a very similar problem and solved it with Hive and ORC files using the Spark SQLContext. * Create a table in Hive stored as an ORC file (I recommend using partitioning too) * Use SQLContext.sql to Insert data into the table * Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger
Neither of those statements are true. You need more receivers if you want more parallelism. You don't have to manage offset positioning with the direct stream if you don't want to, as long as you can accept the limitations of Spark checkpointing. On Fri, Oct 2, 2015 at 10:52 AM,

Spark Streaming over YARN

2015-10-02 Thread nibiau
Hello, I have a job receiving data from kafka (4 partitions) and persisting data inside MongoDB. It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) only on node is receiving all the kafka partitions and only one node is processing my RDD treatment (foreach function)

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Michael Armbrust
import org.apache.spark.sql.functions.* callUDF("MyUDF", col("col1"), col("col2")) On Fri, Oct 2, 2015 at 6:25 AM, unk1102 wrote: > Hi I have registed my hive UDF using the following code: > > hiveContext.udf().register("MyUDF",new UDF1(String,String)) { > public String

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger
If you're using the receiver based implementation, and want more parallelism, you have to create multiple streams and union them together. Or use the direct stream. On Fri, Oct 2, 2015 at 10:40 AM, wrote: > Hello, > I have a job receiving data from kafka (4 partitions) and

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau
>From my understanding as soon as I use YARN I don't need to use parrallelisme >(at least for RDD treatment) I don't want to use direct stream as I have to manage the offset positionning (in order to be able to start from the last offset treated after a spark job failure) - Mail original

Re: SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-02 Thread Michael Armbrust
Once you convert your data to a dataframe (look at spark-csv), try df.write.partitionBy("", "mm").save("..."). On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram < haridass.saisri...@gmail.com> wrote: > Hi, > > I am trying to find a simple example to read a data file on HDFS. The > file

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau
Ok so if I set for example 4 receivers (number of nodes), how RDD will be distributed over the nodes/core. For example in my example I have 4 nodes (with 2 cores) Tks Nicolas - Mail original - De: "Dibyendu Bhattacharya" À: nib...@free.fr Cc: "Cody

Re: Spark Streaming over YARN

2015-10-02 Thread Dibyendu Bhattacharya
If your Kafka topic has 4 partitions , and if you specify 4 Receivers, messages from each partitions are received by a dedicated receiver. so your receiving parallelism is defined by your number of partitions of your topic . Every receiver task will be scheduled evenly among nodes in your

Re: HDFS small file generation problem

2015-10-02 Thread nibiau
Ok thanks, but can I also update data instead of insert data ? - Mail original - De: "Brett Antonides" À: user@spark.apache.org Envoyé: Vendredi 2 Octobre 2015 18:18:18 Objet: Re: HDFS small file generation problem I had a very similar problem and solved it

are functions deserialized once per task?

2015-10-02 Thread Michael Albert
Greetings! Is it true that functions, such as those passed to RDD.map(), are deserialized once per task?This seems to be the case looking at Executor.scala, but I don't really understand the code. I'm hoping the answer is yes because that makes it easier to write code without worrying about

Re: Spark Streaming over YARN

2015-10-02 Thread Cody Koeninger
Direct stream has nothing to do with Zookeeper. The direct stream can start at the offsets you specify. If you're not storing offsets in checkpoints, how and where you store them is up to you. Have you read / watched the information linked from https://github.com/koeninger/kafka-exactly-once

Re: Weird Spark Dispatcher Offers?

2015-10-02 Thread Tim Chen
Do you have jobs enqueued? And if none of the jobs matches any offer it will just decline it. What's your job resource specifications? Tim On Fri, Oct 2, 2015 at 11:34 AM, Alan Braithwaite wrote: > Hey All, > > Using spark with mesos and docker. > > I'm wondering if

Weird Spark Dispatcher Offers?

2015-10-02 Thread Alan Braithwaite
Hey All, Using spark with mesos and docker. I'm wondering if anybody's seen the behavior of spark dispatcher where it just continually requests resources and immediately declines the offer. https://gist.github.com/anonymous/41e7c91899b0122b91a7 I'm trying to debug some issues with spark and

Re: Spark Streaming over YARN

2015-10-02 Thread nibiau
Sorry, I just said that I NEED to manage offsets, so in case of Kafka Direct Stream , how can I handle this ? Update Zookeeper manually ? why not but any other solutions ? - Mail original - De: "Cody Koeninger" À: "Nicolas Biau" Cc: "user"

Re: Kafka Direct Stream

2015-10-02 Thread Gerard Maas
Something like this? I'm making the assumption that your topic name equals your keyspace for this filtering example. dstream.foreachRDD{rdd => val topics = rdd.map(_._1).distinct.collect topics.foreach{topic => val filteredRdd = rdd.collect{case (t, data) if t == topic => data}.

Re: automatic start of streaming job on failure on YARN

2015-10-02 Thread Ashish Rangole
Are you running the job in yarn cluster mode? On Oct 1, 2015 6:30 AM, "Jeetendra Gangele" wrote: > We've a streaming application running on yarn and we would like to ensure > that is up running 24/7. > > Is there a way to tell yarn to automatically restart a specific >

RE: from groupBy return a DataFrame without aggregation?

2015-10-02 Thread Diggs, Asoka
I may not be understanding your question - for a given date, you have many ID values - is that correct? Are there additional columns in this dataset that you aren't mentioning, or are we simply dealing with id and dt? What structure do you need the return data to be in? If you're looking for

Re: Kafka Direct Stream

2015-10-02 Thread varun sharma
Hi Adrian, Can you please give an example of how to achieve this: > *I would also look at filtering by topic and saving as different Dstreams > in your code* I have managed to get DStream[(String, String)] which is (*topic,my_data)* tuple. Lets call it kafkaStringStream. Now if I do

Re: Checkpointing is super slow

2015-10-02 Thread Sourabh Chandak
I can see the entries processed in the table very fast but after that it takes a long time for the checkpoint update. Haven't tried other methods of checkpointing yet, we are using DSE on Azure. Thanks, Sourabh On Fri, Oct 2, 2015 at 6:52 AM, Cody Koeninger wrote: > Why

Re: Kafka Direct Stream

2015-10-02 Thread varun sharma
Hi Nicolae, Won't creating N KafkaDirectStreams be an overhead for my streaming job compared to Single DirectStream? On Fri, Oct 2, 2015 at 1:13 AM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > If you just need processing per topic, why not generate N different kafka >

Re: Adding the values in a column of a dataframe

2015-10-02 Thread sethah
df.agg(sum("age")).show() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Adding-the-values-in-a-column-of-a-dataframe-tp24909p24910.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: how to broadcast huge lookup table?

2015-10-02 Thread Saif.A.Ellafi
Hi, thank you I would prefer to leave writing-to-disk as a last resort. Is it a last resort? Saif From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, October 02, 2015 3:54 PM To: Ellafi, Saif A. Cc: user Subject: Re: how to broadcast huge lookup table? Have you considered using external

Re: Weird Spark Dispatcher Offers?

2015-10-02 Thread Alan Braithwaite
> > So if there is no jobs to run the dispatcher will decline all offers by > default. > So would this be a bug in mesos then? I'm not sure I understand how this offer is appearing in the first place. It only shows up in the master logs when I start the dispatcher. > Also we list all the jobs

Re: Weird Spark Dispatcher Offers?

2015-10-02 Thread Alan Braithwaite
This happened right after blowing away /var/lib/mesos zk://mesos and zk://spark_mesos_dispatcher and before I've submitted anything new to it so I _shouldn't_ have anything enqueued. Unless there's state being stored somewhere besides those places that I don't know about. I'm not sure what the

Re: Weird Spark Dispatcher Offers?

2015-10-02 Thread Tim Chen
Hi Alan, The dispatcher is a Mesos framework and all frameworks in Mesos receives offers from the master. Mesos is different than most schedulers where we don't issue containers based on requests, but we offer available resources to all frameworks and they in turn decide if they want to use these

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Umesh Kacha
Hi Michael, Thanks much. How do we give alias name for resultant columns? For e.g. when using hiveContext.sql("select MyUDF("test") as mytest from myTable"); how do we do that in DataFrame callUDF callUDF("MyUDF", col("col1"))??? On Fri, Oct 2, 2015 at 8:23 PM, Michael Armbrust

how to broadcast huge lookup table?

2015-10-02 Thread Saif.A.Ellafi
I tried broadcasting a key-value rdd, but then I cannot perform any rdd-actions inside a map/foreach function of another rdd. any tips? If going into scala collections I end up with huge memory bottlenecks. Saif

Re: how to broadcast huge lookup table?

2015-10-02 Thread Ted Yu
Have you considered using external storage such as hbase for storing the look up table ? Cheers On Fri, Oct 2, 2015 at 11:50 AM, wrote: > I tried broadcasting a key-value rdd, but then I cannot perform any > rdd-actions inside a map/foreach function of another

No plan for broadcastHint

2015-10-02 Thread Swapnil Shinde
Hello I am trying to do inner join with broadcastHint and getting below exception - I tried to increase "sqlContext.conf.autoBroadcastJoinThreshold" but still no luck. *Code snippet-* val dpTargetUvOutput = pweCvfMUVDist.as("a").join(broadcast(sourceAssgined.as("b")), $"a.web_id" ===

Re: Weird Spark Dispatcher Offers?

2015-10-02 Thread Tim Chen
So if there is no jobs to run the dispatcher will decline all offers by default. Also we list all the jobs enqueued and it's specifications in the Spark dispatcher UI, you should see the port in the dispatcher logs itself. Tim On Fri, Oct 2, 2015 at 11:46 AM, Alan Braithwaite

Re: Problem understanding spark word count execution

2015-10-02 Thread Kartik Mathur
Thanks Yong, my script is pretty straight forward - *sc.textFile("/wc/input").flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_).saveAsTextFile("/wc/out2") *//both paths are HDFS. so if for every shuffle write , it always writes to disk , what is the meaning of these

Reading JSON in Pyspark throws scala.MatchError

2015-10-02 Thread balajikvijayan
Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. I'm trying to read in a large quantity of json data in a couple of files and I receive a scala.MatchError when I do so. Json, Python and stack trace all shown below. Json: { "dataunit": { "page_view": {

RE: Problem understanding spark word count execution

2015-10-02 Thread java8964
These parameters in fact control the behavior on reduce side, as in your word count example. The partitions will be fetched by the reducer which being assigned to it. The reducer will fetch corresponding partitions from different mappers output, and it will process the data based on your logic

Re: How to use registered Hive UDF in Spark DataFrame?

2015-10-02 Thread Michael Armbrust
callUDF("MyUDF", col("col1").as("name") or callUDF("MyUDF", col("col1").alias("name") On Fri, Oct 2, 2015 at 3:29 PM, Umesh Kacha wrote: > Hi Michael, > > Thanks much. How do we give alias name for resultant columns? For e.g. > when using > > hiveContext.sql("select

Re: how to get Application ID from Submission ID or Driver ID programmatically

2015-10-02 Thread firemonk9
Have you found how to get the applicationId from submissionId ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-Application-ID-from-Submission-ID-or-Driver-ID-programmatically-tp24341p24912.html Sent from the Apache Spark User List mailing list

Re: Checkpointing is super slow

2015-10-02 Thread Sourabh Chandak
Tried using local checkpointing as well, and even that becomes slow after sometime. Any idea what can be wrong? Thanks, Sourabh On Fri, Oct 2, 2015 at 9:35 AM, Sourabh Chandak wrote: > I can see the entries processed in the table very fast but after that it > takes a

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-02 Thread Ted Yu
I got the following when parsing your input with master branch (Python version 2.6.6): http://pastebin.com/1w8WM3tz FYI On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan wrote: > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > I'm trying to read in a

How does FAIR job scheduler work in Standalone cluster mode?

2015-10-02 Thread Jacek Laskowski
Hi, The docs in Resource Scheduling [1] says: > The standalone cluster mode currently only supports a simple FIFO scheduler > across applications. There's however `spark.scheduler.mode` that can be one of `FAIR`, `FIFO`, `NONE` values. Is FAIR available for Spark Standalone cluster mode? Is

Re: How does FAIR job scheduler work in Standalone cluster mode?

2015-10-02 Thread Marcelo Vanzin
You're mixing app scheduling in the cluster manager (your [1] link) with job scheduling within an app (your [2] link). They're independent things. On Fri, Oct 2, 2015 at 2:22 PM, Jacek Laskowski wrote: > Hi, > > The docs in Resource Scheduling [1] says: > >> The standalone

Re: How does FAIR job scheduler work in Standalone cluster mode?

2015-10-02 Thread Marcelo Vanzin
On Fri, Oct 2, 2015 at 5:29 PM, Jacek Laskowski wrote: >> The standalone cluster mode currently only supports a simple FIFO scheduler >> across applications. > > is correct or not? :( I think so. But, because they're different things, that does not mean you cannot use a fair

Re: How does FAIR job scheduler work in Standalone cluster mode?

2015-10-02 Thread Jacek Laskowski
Hi, I may indeed mistakenly be mixing different aspect. Thanks for the answer! Does this answer my initial question, though, as I'm still unsure whether the sentence: > The standalone cluster mode currently only supports a simple FIFO scheduler > across applications. is correct or not? :(

Re: Problem understanding spark word count execution

2015-10-02 Thread Kartik Mathur
Thanks Yong , That was a good explanation I was looking for , however I have one doubt , you write - *"**Image that you have 2 mappers to read the data, then each mapper will generate the (word, count) tuple output in segments. Spark always output that in local file. (In fact, one file with

Re: Limiting number of cores per job in multi-threaded driver.

2015-10-02 Thread Philip Weaver
You can't really say 8 cores is not much horsepower when you have no idea what my use case is. That's silly. On Fri, Sep 18, 2015 at 10:33 PM, Adrian Tanase wrote: > Forgot to mention that you could also restrict the parallelism to 4, > essentially using only 4 cores at any

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-10-02 Thread satish chandra j
Hi, I am getting the below error while implementing the above custom class code given by you error type mismatch: found String required: Serializable Please let me know if i am missing anything here Regards, Satish Chandra On Wed, Sep 23, 2015 at 12:34 PM, Petr Novak

Re: Checkpointing is super slow

2015-10-02 Thread Tathagata Das
Could you get the log4j INFO/DEBUG level logs which shows the error, and if possible time taken to write the checkpoints. On Fri, Oct 2, 2015 at 6:28 PM, Sourabh Chandak wrote: > Offset checkpoints (partition, offset) when using kafka direct streaming > approach > > > On

Re: Checkpointing is super slow

2015-10-02 Thread Sourabh Chandak
Offset checkpoints (partition, offset) when using kafka direct streaming approach On Friday, October 2, 2015, Tathagata Das wrote: > Which checkpointing are you talking about? DStream checkpoints (which > saves the DAG of DStreams, that is, only metadata), or RDD

Re: Checkpointing is super slow

2015-10-02 Thread Tathagata Das
Which checkpointing are you talking about? DStream checkpoints (which saves the DAG of DStreams, that is, only metadata), or RDD checkpointing (which saves the actual intermediate RDD data) TD On Fri, Oct 2, 2015 at 2:56 PM, Sourabh Chandak wrote: > Tried using local