Re: Hive to Oracle using Spark - Type(Date) conversion issue

2018-03-19 Thread Gurusamy Thirupathy
Hi guha, Thanks for your quick response, option a and b are in our table already. For option b, again the same problem, we don't know which column is date. Thanks, -G On Sun, Mar 18, 2018 at 9:36 PM, Deepak Sharma wrote: > The other approach would to write to temp

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
Yes it indeed makes sense! Is there a way to get incremental counts when I start from 0 and go through 10M records? perhaps count for every micro batch or something? On Mon, Mar 19, 2018 at 1:57 PM, Geoff Von Allmen wrote: > Trigger does not mean report the current

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread Geoff Von Allmen
Trigger does not mean report the current solution every 'trigger seconds'. It means it will attempt to fetch new data and process it no faster than trigger seconds intervals. If you're reading from the beginning and you've got 10M entries in kafka, it's likely pulling everything down then

select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
Hi All, I have 10 million records in my Kafka and I am just trying to spark.sql(select count(*) from kafka_view). I am reading from kafka and writing to kafka. My writeStream is set to "update" mode and trigger interval of one second ( Trigger.ProcessingTime(1000)). I expect the counts to be

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi Keith, Thank you for the idea! I have tried it, so now the executor command is looking in the following way : /bin/bash -c /usr/java/latest//bin/java -server -Xmx51200m '-Djava.io.tmpdir=my_prefered_path'

Re: Structured Streaming: distinct (Spark 2.2)

2018-03-19 Thread Burak Yavuz
I believe the docs are out of date regarding distinct. The behavior should be as follows: - Distinct should be applied across triggers - In order to prevent the state from growing indefinitely, you need to add a watermark - If you don't have a watermark, but your key space is small, that's

Structured Streaming: distinct (Spark 2.2)

2018-03-19 Thread Geoff Von Allmen
I see in the documentation that the distinct operation is not supported in Structured Streaming. That being said, I have noticed that you are able to successfully call distinct() on a data

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Can you try setting spark.executor.extraJavaOptions to have -D java.io.tmpdir=someValue Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma wrote: > Hi Keith, > > Thank you for your answer! > I have done this, and it is working for

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi Keith, Thank you for your answer! I have done this, and it is working for spark driver. I would like to make something like this for the executors as well, so that the setting will be used on all the nodes, where I have executors running. Best, Michael On Mon, Mar 19, 2018 at 6:07 PM, Keith

Re: Accessing a file that was passed via --files to spark submit

2018-03-19 Thread Marcelo Vanzin
>From spark-submit -h: --files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName). On Sun, Mar

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Hi Michael, You could either set spark.local.dir through spark conf or java.io.tmpdir system property. Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma wrote: > Hi everybody, > > I am running spark job on yarn, and my problem is

Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Michael Shtelma
Hi everybody, I am running spark job on yarn, and my problem is that the blockmgr-* folders are being created under /tmp/hadoop-msh/nm-local-dir/usercache/msh/appcache/application_id/* The size of this folder can grow to a significant size and does not really fit into /tmp file system for one

Re: Calling Pyspark functions in parallel

2018-03-19 Thread Debabrata Ghosh
Thanks Jules ! Appreciate it a lot indeed ! On Mon, Mar 19, 2018 at 7:16 PM, Jules Damji wrote: > What’s your PySpark function? Is it a UDF? If so consider using pandas UDF > introduced in Spark 2.3. > > More info here: https://databricks.com/blog/2017/10/30/introducing-

Warnings on data insert into Hive Table using PySpark

2018-03-19 Thread Shahab Yunus
Hi there. When I try to insert data into hive tables using the following query, I get these warnings below. The data is inserted fine (the query works without warning directly on hive cli as well.) What is the reason for these warnings and how can we get rid of them? I am using pyspark

[Spark Structured Streaming, Spark 2.3.0] Calling current_timestamp() function within a streaming dataframe results in dataType error

2018-03-19 Thread Artem Moskvin
Hi all, There's probably a regression in Spark 2.3.0. Running the code below in 2.2.1 succeeds but in 2.3.0 results in error `org.apache.spark.sql.streaming.StreamingQueryException: Invalid call to dataType on unresolved object, tree: 'current_timestamp`. ``` import

Re: Calling Pyspark functions in parallel

2018-03-19 Thread Jules Damji
What’s your PySpark function? Is it a UDF? If so consider using pandas UDF introduced in Spark 2.3. More info here: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html Sent from my iPhone Pardon the dumb thumb typos :) > On Mar 18, 2018, at 10:54 PM,

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
Hi Jörn, thanks for your reply Oozie starts ooze java action as single "long running" MapReduce Mapper. This mapper is responsible for calling main class. Main class belongs to user and this main class starts spark job. yarn-cluster is not an option for me. I have to do something special to

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Jörn Franke
Maybe you should better run it in yarn cluster mode. Yarn client would start the driver on the oozie server. > On 19. Mar 2018, at 12:58, Serega Sheypak wrote: > > I'm trying to run it as Oozie java action and reduce env dependency. The only > thing I need is Hadoop

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Serega Sheypak
I'm trying to run it as Oozie java action and reduce env dependency. The only thing I need is Hadoop Configuration to talk to hdfs and yarn. Spark submit is a shell thing. Trying to do all from jvm. Oozie java action starts main class which inststiates SparkConf and session. It works well in local

Re: Run spark 2.2 on yarn as usual java application

2018-03-19 Thread Jacek Laskowski
Hi, What's the deployment process then (if not using spark-submit)? How is the AM deployed? Why would you want to skip spark-submit? Jacek On 19 Mar 2018 00:20, "Serega Sheypak" wrote: > Hi, Is it even possible to run spark on yarn as usual java application? > I've