Re: --jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Mikhailau, Alex
Thanks, Marcelo. Will give it a shot tomorrow. -Alex On 8/9/17, 5:59 PM, "Marcelo Vanzin" wrote: Jars distributed using --jars are not added to the system classpath, so log4j cannot see them. To work around that, you need to manually add the *name* jar to

Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
hello guys: I have a simple rdd like :val userIDs = 1 to 1val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id And I have a List[String] like below:scala> listForRule77 res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800,

[Structured Streaming] Recommended way of joining streams

2017-08-09 Thread Priyank Shrivastava
I have streams of data coming in from various applications through Kafka. These streams are converted into dataframes in Spark. I would like to join these dataframes on a common ID they all contain. Since joining streaming dataframes is currently not supported, what is the current recommended

Re: Multiple queries on same stream

2017-08-09 Thread Jörn Franke
This is not easy to say without testing. It depends on type of computation etc. it also depends on the Spark version. Generally vectorization / SIMD could be much faster if it is applied by Spark / the JVM in scenario 2. > On 9. Aug 2017, at 07:05, Raghavendra Pandey

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM,

回复:Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Riccardo and Ryan Thank you for your ideas.It seems that crossjoin is a new dataset api after spark2.x. my spark version is 1.6.3. Is there a relative api to do crossjoin?thank you. ThanksBest regards! San.Luo - 原始邮件 - 发件人:Riccardo

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread ayan guha
If you use join without any condition in becomes cross join. In sql, it looks like Select a.*,b.* from a join b On Wed, 9 Aug 2017 at 7:08 pm, wrote: > Riccardo and Ryan >Thank you for your ideas.It seems that crossjoin is a new dataset api > after spark2.x. > my

Re: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread Matteo Cossu
Hello, try to use these options when starting Spark: *--conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" * In this way you will be sure that the executor and the driver of Spark will use the classpath you define. Best Regards, Matteo Cossu On 5

Re: [Structured Streaming] Recommended way of joining streams

2017-08-09 Thread Tathagata Das
Writing streams into some sink (preferably fault-tolerant, exactly once sink, see docs) and then joining is definitely a possible way. But you will likely incur higher latency. If you want lower latency, then stream-stream joins is the best approach, which we are working on right now. Spark 2.3 is

Re: Multiple queries on same stream

2017-08-09 Thread Tathagata Das
Its important to note that running multiple streaming queries, as of today, would read the input data that many number of time. So there is a trade off between the two approaches. So even though scenario 1 wont get great catalyst optimization, it may be more efficient overall in terms of resource

回复:Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Thank you guys, I got my code worked like below:val record75df = sc.parallelize(listForRule75, numPartitions).map(x=> x.replace("|", ",")).map(_.split(",")).map(x => Mycaseclass4(x(0).toInt,x(1).toInt,x(2).toFloat,x(3).toInt)).toDF() val userids = 1 to 1 val uiddf =

Re: Reusing dataframes for streaming (spark 1.6)

2017-08-09 Thread Tathagata Das
There is a DStream.transform() that does exactly this. On Tue, Aug 8, 2017 at 7:55 PM, Ashwin Raju wrote: > Hi, > > We've built a batch application on Spark 1.6.1. I'm looking into how to > run the same code as a streaming (DStream based) application. This is using > pyspark.

Re[2]: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread toletum
Thanks Matteo I fixed it Regards, JCS On Mié., Ago. 9, 2017 at 11:22, Matteo Cossu wrote: Hello, try to use these options when starting Spark: --conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" In this way you will be sure that the executor and the

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Riccardo Ferrari
Depends on your Spark version, have you considered the Dataset api? You can do something like: val df1 = rdd1.toDF("userid") val listRDD = sc.parallelize(listForRule77) val listDF = listRDD.toDF("data") df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false)

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
rdd has a cartesian method On Wed, Aug 9, 2017 at 5:12 PM, ayan guha wrote: > If you use join without any condition in becomes cross join. In sql, it > looks like > > Select a.*,b.* from a join b > > On Wed, 9 Aug 2017 at 7:08 pm, wrote: > >> Riccardo

Re: Re[2]: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread Gourav Sengupta
Hi, Just out of sheer curiosity - why are you using SPARK 1.6? Since then SPARK has made significant advancement and improvement, why not take advantage of that? Regards, Gourav On Wed, Aug 9, 2017 at 10:41 AM, toletum wrote: > Thanks Matteo > > I fixed it > > Regards,

Re[4]: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread toletum
Yes... I know... but The cluster is not administered by me On Mié., Ago. 9, 2017 at 13:46, Gourav Sengupta wrote: Hi, Just out of sheer curiosity - why are you using SPARK 1.6? Since then SPARK has made significant advancement and improvement, why not take advantage of that? Regards,

StreamingQueryListner spark structered Streaming

2017-08-09 Thread purna pradeep
Im working on structered streaming application wherein im reading from Kafka as stream and for each batch of streams i need to perform S3 lookup file (which is nearly 200gb) to fetch some attributes .So im using df.persist() (basically caching the lookup) but i need to refresh the dataframe as the

Anyone has come across incorta that relies on Spark,Parquet and open source ibraries.

2017-08-09 Thread Mich Talebzadeh
Hi, There is a tool called incorta that uses Spark, Parquet and open source big data analytics libraries. Its aim is to accelerate Analytics. It claims that it incorporates Direct Data Mapping to deliver near real-time analytics on top of original, intricate, transactional data such as ERP

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-09 Thread Cody Koeninger
org.apache.spark.streaming.kafka.KafkaCluster has methods getLatestLeaderOffsets and getEarliestLeaderOffsets On Mon, Aug 7, 2017 at 11:37 PM, shyla deshpande wrote: > Thanks TD. > > On Mon, Aug 7, 2017 at 8:59 PM, Tathagata Das > wrote: >>

--jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Mikhailau, Alex
I have log4j json layout jars added via spark-submit on EMR /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --jars /home/hadoop/lib/jsonevent-layout-1.7.jar,/home/hadoop/lib/json-smart-1.1.1.jar --driver-java-options "-XX:+AlwaysPreTouch -XX:MaxPermSize=6G" --class

Re: --jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Marcelo Vanzin
Jars distributed using --jars are not added to the system classpath, so log4j cannot see them. To work around that, you need to manually add the *name* jar to the driver executor classpaths: spark.driver.extraClassPath=some.jar spark.executor.extraClassPath=some.jar In client mode you should

Spark SVD benchmark for dense matrices

2017-08-09 Thread Jose Francisco Saray Villamizar
Hi everyone, I am trying to invert a 5000 x 5000 Dense Matrix (99% non-zeros), by using SVD with an approach simmilar to : https://stackoverflow.com/questions/29969521/how-to-compute-the-inverse-of-a-rowmatrix-in-apache-spark The time Im getting with SVD is close to 10 minutes what is very long