Re: Message getting lost in Kafka + Spark Streaming

2017-05-31 Thread Vikash Pareek
Thanks Sidney for your response, To check if all the messages are processed I used accumulator and also add a print statement for debuging. *val accum = ssc.sparkContext.accumulator(0, "Debug Accumulator")* *...* *...* *...* *val mappedDataStream = dataStream.map(_._2);* * mappedDataStream.

RE: Message getting lost in Kafka + Spark Streaming

2017-05-31 Thread Sidney Feiner
Are you sure that every message gets processed? It could be that some messages failed passing the decoder. And during the processing, are you maybe putting the events into a map? That way, events with the same key could override each other and that way you'll have less final events. -Origin

RE: The following Error seems to happen once in every ten minutes (Spark Structured Streaming)?

2017-05-31 Thread Mahesh Sawaiker
Your data node(s) is/are going down for some reason, check the logs of the datanode and fix the underlying issue why datanode is going down. There should be no need to delete any data, just starting the data nodes should do the trick for you. From: kant kodali [mailto:kanth...@gmail.com] Sent: T

RE: Spark sql with Zeppelin, Task not serializable error when I try to cache the spark sql table

2017-05-31 Thread Mahesh Sawaiker
It’s because the class in which you have defined the udf is not serializable. Declare the udf in a class and make the class seriablizable. From: shyla deshpande [mailto:deshpandesh...@gmail.com] Sent: Thursday, June 01, 2017 10:08 AM To: user Subject: Spark sql with Zeppelin, Task not serializable

Spark sql with Zeppelin, Task not serializable error when I try to cache the spark sql table

2017-05-31 Thread shyla deshpande
Hello all, I am using Zeppelin 0.7.1 with Spark 2.1.0 I am getting org.apache.spark.SparkException: Task not serializable error when I try to cache the spark sql table. I am using a UDF on a column of table and want to cache the resultant table . I can execute the paragraph successfully when ther

Question about mllib.recommendation.ALS

2017-05-31 Thread Sahib Aulakh [Search] ­
Hello: I am training the ALS model for recommendations. I have about 200m ratings from about 10m users and 3m products. I have a small cluster with 48 cores and 120gb cluster-wide memory. My code is very similar to the example code spark/examples/src/main/scala/org/apache/spark/examples/mllib/Mo

The following Error seems to happen once in every ten minutes (Spark Structured Streaming)?

2017-05-31 Thread kant kodali
Hi All, When my query is streaming I get the following error once in say 10 minutes. Lot of the solutions online seems to suggest just clear data directories under datanode and namenode and restart the HDFS cluster but I didn't see anything that explains the cause? If it happens so frequent what d

[apache-spark] Re: Problem with master webui with reverse proxy when workers >= 10

2017-05-31 Thread Trevor McKay
So one thing I've noticed, if I turn on as much debug logging as I can find, is that when reverse proxy is enabled I end up with a lot of these threads being logged: 17/05/31 21:06:37 DEBUG SelectorManager: Starting Thread[MasterUI-218-s elector-ClientSelectorManager@36a67cc9/14,5,main] on org.spa

Re: Using SparkContext in Executors

2017-05-31 Thread lucas.g...@gmail.com
+1 to Ayan's answer, I think this is a common distributed anti pattern that trips us all up at some point or another. You definitely want to (in most cases) yield and create a new RDD/Dataframe/Dataset and then perform your save operation on that. On 28 May 2017 at 21:09, ayan guha wrote: > Hi

good http sync client to be used with spark

2017-05-31 Thread vimal dinakaran
Hi, In our application pipeline we need to push the data from spark streaming to a http server. I would like to have a http client with below requirements. 1. synchronous calls 2. Http connection pool support 3. light weight and easy to use. spray,akka http are mostly suited for async call . Co

mapWithState termination

2017-05-31 Thread Dominik Safaric
Dear all, I would appreciate if anyone could explain when does mapWithState terminate, i.e. apply subsequent transformations such as writing the state to an external sink? Given a KafkaConsumer instance pulling messages from a Kafka topic, and a mapWithState transformation updating the state

Re: Running into the same problem as JIRA SPARK-20325

2017-05-31 Thread Michael Armbrust
> > So, my question is the same as stated in the following ticket which is Do > we need create a checkpoint directory for each individual query? > Yes. Checkpoints record what data has been processed. Thus two different queries need their own checkpoints.

Problem with master webui with reverse proxy when workers >= 10

2017-05-31 Thread tmckay
Hi folks, I'm running a containerized spark 2.1.0 setup. If I have reverse proxy turned on, everything is fine if I have fewer than 10 workers. If I create a cluster with 10 or more, the master web ui is unavailable. This is even true for the same cluster if I start with a lower number and ad

Running into the same problem as JIRA SPARK-20325

2017-05-31 Thread kant kodali
Hi All, I am using Spark 2.1.1 and forEachSink to write to Kafka. I call .start and .awaitTermination for each query however I get the following error "Cannot start query with id d4b3554d-ee1d-469c-bf9d-19976c8a7b47 as another query with same id is already active" So, my question is the same as

An Architecture question on the use of virtualised clusters

2017-05-31 Thread Mich Talebzadeh
Hi, I realize this may not have direct relevance to Spark but has anyone tried to create virtualized HDFS clusters using tools like ISILON or similar? The prime motive behind this approach is to minimize the propagation or copy of data which has regulatory implication. In shoret you want your dat

Creating Dataframe by querying Impala

2017-05-31 Thread morfious902002
Hi, I am trying to create a Dataframe by querying Impala Table. It works fine in my local environment but when I try to run it in cluster I either get Error:java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver or No Suitable Driver found. Can someone help me or direct me to how

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread kant kodali
https://stackoverflow.com/questions/44280360/how-to-convert-datasetrow-to-dataset-of-json-messages-to-write-to-kafka Thanks! On Wed, May 31, 2017 at 1:41 AM, kant kodali wrote: > small correction. > > If I try to convert a Row into a Json String it results into something > like this {"key1", "n

Re: Worker node log not showed

2017-05-31 Thread Paolo Patierno
No it's running in standalone mode as Docker image on Kubernetes. The only way I found was to access "stderr" file created under the "work" directory in the SPARK_HOME but ... is it the right way ? Paolo Patierno Senior Software Engineer (IoT) @ Red Hat Microsoft MVP on Windows Embedded & IoT

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread kant kodali
small correction. If I try to convert a Row into a Json String it results into something like this {"key1", "name", "value1": "hello", "key2", "ratio", "value2": 1.56 , "key3", "count", "value3": 34} but *what I need is something like this { result: {"name": "hello", "ratio": 1.56, "count": 34} }

Re: Worker node log not showed

2017-05-31 Thread Alonso Isidoro Roman
Are you running the code with yarn? if so, figure out the applicationID through the web ui, then run the next command: yarn logs your_application_id Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread kant kodali
Hi Jules, I read that blog several times prior to asking this question. Thanks! On Wed, May 31, 2017 at 12:12 AM, Jules Damji wrote: > Hello Kant, > > See is the examples in this blog explains how to deal with your particular > case: https://databricks.com/blog/2017/02/23/working-complex-data-

Help in Parsing 'Categorical' type of data

2017-05-31 Thread Amlan Jyoti
Hi, I am trying to run Naive Bayes Model using Spark ML libraries, in Java. The sample snippet of dataset is given below: Raw Data - But, as the input data needs to in numeric, so I am using one-hot-encoder on the Gender field[m->0,1][f->1,0]; and the finally the 'features' vector is inputte

Worker node log not showed

2017-05-31 Thread Paolo Patierno
Hi all, I have a simple cluster with one master and one worker. On another machine I launch the driver where at some point I have following line of codes : max.foreachRDD(rdd -> { LOG.info("*** max.foreachRDD"); rdd.foreach(value -> { LOG.info("*** rdd.foreach"); }); })

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread Jules Damji
Hello Kant, See is the examples in this blog explains how to deal with your particular case: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On May 30, 2017, at 7