Combining reading from Kafka and HDFS w/ Spark Streaming

2017-03-01 Thread Mike Thomsen
(Sorry if this is a duplicate. I got a strange error message when I first tried to send it earlier) I want to pull HDFS paths from Kafka and build text streams based on those paths. I currently have: val lines = KafkaUtils.createStream(/* params here */).map(_._2) val buffer = new

Best way to assign a unique IDs to row groups

2017-03-01 Thread Everett Anderson
Hi, I've used functions.monotonically_increasing_id() for assigning a unique ID to all rows, but I'd like to assign a unique ID to each group of rows with the same key. The two ways I can think of to do this are Option 1: Create separate group ID table and join back - Create a new data

Spark 2.0 issue with left_outer join

2017-03-01 Thread Ankur Srivastava
Hi Users, We are facing an issue with left_outer join using Spark Dataset api in 2.0 Java API. Below is the code we have Dataset badIds = filteredDS.groupBy(col("id").alias("bid")).count() .filter((FilterFunction) row -> (Long) row.getAs("count") > 75000); _logger.info("Id count with

reaasign location of partitions

2017-03-01 Thread Simona Rabinovici-Cohen
Hello, We have an application in which specific partitions need to be executed on specific worker machines. Thus, we would like to assign for each partition, the worker machine that will execute that partition. Moreover we would like to do it before each stage of the SPARK app, and be able to

Combining reading from Kafka and HDFS w/ Spark Streaming

2017-03-01 Thread Mike Thomsen
I want to pull HDFS paths from Kafka and build text streams based on those paths. I currently have: val lines = KafkaUtils.createStream(/* params here */).map(_._2) val buffer = new ArrayBuffer[String]() lines.foreachRDD(rdd => { if (!rdd.partitions.isEmpty) { rdd.collect().foreach(line

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Bill Schwanitz
Subhash, Yea that did the trick thanks! On Wed, Mar 1, 2017 at 12:20 PM, Subhash Sriram wrote: > If I am understanding your problem correctly, I think you can just create > a new DataFrame that is a transformation of sample_data by first > registering sample_data as a

Re: Why spark history server does not show RDD even if it is persisted?

2017-03-01 Thread Parag Chaudhari
Thanks! *Thanks,Parag Chaudhari,**USC Alumnus (Fight On!)* *Mobile : (213)-572-7858* *Profile: http://www.linkedin.com/pub/parag-chaudhari/28/a55/254 * On Tue, Feb 28, 2017 at 12:53 PM, Shixiong(Ryan) Zhu < shixi...@databricks.com>

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Subhash Sriram
If I am understanding your problem correctly, I think you can just create a new DataFrame that is a transformation of sample_data by first registering sample_data as a temp table. //Register temp table sample_data.createOrReplaceTempView("sql_sample_data") //Create new DataSet with transformed

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Marco Mistroni
Hi I think u need an UDF if u want to transform a column Hth On 1 Mar 2017 4:22 pm, "Bill Schwanitz" wrote: > Hi all, > > I'm fairly new to spark and scala so bear with me. > > I'm working with a dataset containing a set of column / fields. The data > is stored in hdfs as

question on transforms for spark 2.0 dataset

2017-03-01 Thread Bill Schwanitz
Hi all, I'm fairly new to spark and scala so bear with me. I'm working with a dataset containing a set of column / fields. The data is stored in hdfs as parquet and is sourced from a postgres box so fields and values are reasonably well formed. We are in the process of trying out a switch from

Re: [Spark] Accumulators or count()

2017-03-01 Thread Daniel Siegmann
As you noted, Accumulators do not guarantee accurate results except in specific situations. I recommend never using them. This article goes into some detail on the problems with accumulators: http://imranrashid.com/posts/Spark-Accumulators/ On Wed, Mar 1, 2017 at 7:26 AM, Charles O. Bajomo <

Re: Continuous or Categorical

2017-03-01 Thread Richard Siebeling
I think it's difficult to determine with certainty if a variable is continuous or categorical, what to do when the values are numbers like 1, 2, 2, 3, 4, 5. These values can both be continuous as categorical. for exa However you could perform some checks: - are there any decimal values > it will

Continuous or Categorical

2017-03-01 Thread Madabhattula Rajesh Kumar
Hi, How to check given a set of values(example:- Column values in CSV file) are Continuous or Categorical? Any statistical test is available? Regards, Rajesh

Re: Spark Streaming - java.lang.ClassNotFoundException Scala anonymous function

2017-03-01 Thread Dominik Safaric
The jars I am submitting are the following: bin/spark-submit --class topology.SimpleProcessingTopology --master spark://10.0.0.8:7077 --jars /tmp/spark_streaming-1.0-SNAPSHOT.jar /tmp//tmp/spark_streaming-1.0-SNAPSHOT.jar /tmp/streaming.properties I’ve even tried using the

Re: Spark Streaming - java.lang.ClassNotFoundException Scala anonymous function

2017-03-01 Thread Sean Owen
What is the --jars you are submitting? You may have conflicting copies of Spark classes that interfere. On Wed, Mar 1, 2017, 14:20 Dominik Safaric wrote: > I've been trying to submit a Spark Streaming application using > spark-submit to a cluster of mine consisting of

Re: Spark driver CPU usage

2017-03-01 Thread Yong Zhang
It won't control the cpu usage of Driver. You should check out what CPUs are doing on your driver side. But I just want to make sure that you do know the full CPU usage on a 4 cores Linux box will be 400%. So 100% really just make one core busy. Driver does maintain the application web UI,

Spark Streaming - java.lang.ClassNotFoundException Scala anonymous function

2017-03-01 Thread Dominik Safaric
I've been trying to submit a Spark Streaming application using spark-submit to a cluster of mine consisting of a master and two worker nodes. The application has been written in Scala, and build using Maven. Importantly, the Maven build is configured to produce a fat JAR containing all

RE: Spark driver CPU usage

2017-03-01 Thread Phadnis, Varun
Does that configuration parameter affect the CPU usage of the driver? If it does, we have that property unchanged from its default value of "1" yet the same behaviour as before. -Original Message- From: Rohit Verma [mailto:rohit.ve...@rokittech.com] Sent: 01 March 2017 06:08 To:

Re: Spark driver CPU usage

2017-03-01 Thread Rohit Verma
Use conf spark.task.cpus to control number of cpus to use in a task. On Mar 1, 2017, at 5:41 PM, Phadnis, Varun wrote: > > Hello, > > Is there a way to control CPU usage for driver when running applications in > client mode? > > Currently we are observing that the

Spark driver CPU usage

2017-03-01 Thread Phadnis, Varun
Hello, Is there a way to control CPU usage for driver when running applications in client mode? Currently we are observing that the driver occupies all the cores. Launching just 3 instances of driver of WordCount sample application concurrently on the same machine brings the usage of its 4

[Spark] Accumulators or count()

2017-03-01 Thread Charles O. Bajomo
Hello everyone, I wanted to know if there is any benefit to using an acculumator over just executing a count() on the whole RDD. There seems to be a lot of issues with accumulator during a stage failure and also seems to be an issue rebuilding them if the application restarts from a

Number of partitions in Dataset aggregations

2017-03-01 Thread Jakub Dubovsky
Hey all, I know I can control the number of partitions to be used during Dataset aggregations (groupBy, groupByKey, distinct, ...) by spark.sql.shuffle.partitions configuraiton. Is there any specific reason why Dataset api does not support passing number of partitions explicitly to every call of

答复: 答复: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-03-01 Thread Triones,Deng(vip.com)
Thanks for your email My situation I, there is a hive table partitioned by five minutes, I want to write data every 30s into the hdfs location where the table located. So I when the first batch is delay, then the next batch may have the chance to touch the _SUCCESS file at the same

Re: using spark to load a data warehouse in real time

2017-03-01 Thread Sam Elamin
Hi Adaryl Having come from a Web background myself I completely understand your confusion so let me try to clarify a few things First and foremost, Spark is a data processing engine not a general framework. In the Web applications and frameworks world you load the entities, map them to the UI