Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Shubham Chaurasia
Writing: scala> df.write.orc("") For looking into contents, I used orc-tools-X.Y.Z-uber.jar ( https://orc.apache.org/docs/java-tools.html) On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan wrote: > How did you read/write the timestamp value from/to ORC file? > > On Wed, Apr 24, 2019 at 6:30 PM

Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Jeff Evans
The org.apache.spark.launcher.SparkLauncher is used to construct a spark-submit invocation programmatically, via a builder pattern. In our application, which uses a SparkLauncher internally, I would like to log the full spark-submit command that it will invoke to our log file, in order to aid in

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Marcelo Vanzin
Setting the SPARK_PRINT_LAUNCH_COMMAND env variable to 1 in the launcher env will make Spark code print the command to stderr. Not optimal but I think it's the only current option. On Wed, Apr 24, 2019 at 1:55 PM Jeff Evans wrote: > > The org.apache.spark.launcher.SparkLauncher is used to

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Marcelo Vanzin
BTW the SparkLauncher API has hooks to capture the stderr of the spark-submit process into the logging system of the parent process. Check the API javadocs since it's been forever since I looked at that. On Wed, Apr 24, 2019 at 1:58 PM Marcelo Vanzin wrote: > > Setting the

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Sebastian Piu
You could set the env var SPARK_PRINT_LAUNCH_COMMAND and spark-submit will print it, but it will be printed by the subprocess and not yours unless you redirect the stdout Also the command is what spark-submit generates, so it is quite more verbose and includes the classpath etc. I think the

'No plan for EventTimeWatermark' error while using structured streaming with column pruning (spark 2.3.1)

2019-04-24 Thread kineret M
Hi All, I get 'No plan for EventTimeWatermark' error while doing a query with columns pruning using structured streaming with a custom data source that implements Spark datasource v2. My data source implementation that handles the schemas includes the following: class MyDataSourceReader extends

Handle Null Columns in Spark Structured Streaming Kafka

2019-04-24 Thread SNEHASISH DUTTA
Hi, While writing to kafka using spark structured streaming , if all the values in certain column are Null it gets dropped Is there any way to override this , other than using na.fill functions Regards, Snehasish

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
Did you re-create your df when you update the timezone conf? On Wed, Apr 24, 2019 at 9:18 PM Shubham Chaurasia wrote: > Writing: > scala> df.write.orc("") > > For looking into contents, I used orc-tools-X.Y.Z-uber.jar ( > https://orc.apache.org/docs/java-tools.html) > > On Wed, Apr 24, 2019 at

RDD vs Dataframe & when to persist

2019-04-24 Thread Rishi Shah
Hello All, I run into situations where I ask myself should I write map partitions function on RDD or use dataframe all the way (with column + group by ) approach.. I am using Pyspark 2.3 (python 2.7).. I understand we should be utilizing dataframe as much as possible but at time it feels like RDD

[pyspark] Use output of one aggregated function for another aggregated function within the same groupby

2019-04-24 Thread Rishi Shah
Hi All, [PySpark 2.3, python 2.7] I would like to achieve something like this, could you please suggest best way to implement (perhaps highlight pros & cons of the approach in terms of performance)? df = df.groupby('grp_col').agg(max(date).alias('max_date'), count(when col('file_date') ==

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Jeff Evans
Thanks for the pointers. We figured out the stdout/stderr capture piece. I was just looking to capture the full command in order to help debug issues we run into with the submit depending on various combinations of all parameters/classpath, and also to isolate job specific issues from our

Re: [pyspark] Use output of one aggregated function for another aggregated function within the same groupby

2019-04-24 Thread Georg Heiler
Is analytical window funktions to rank the result and then filter to the desired rank. Rishi Shah schrieb am Do. 25. Apr. 2019 um 05:07: > Hi All, > > [PySpark 2.3, python 2.7] > > I would like to achieve something like this, could you please suggest best > way to implement (perhaps highlight

Re: repartition in df vs partitionBy in df

2019-04-24 Thread rajat kumar
Hi All, Can anyone explain? thanks rajat On Sun, 21 Apr 2019, 00:18 kumar.rajat20del Hi Spark Users, > > repartition and partitionBy seems to be very same in Df. > In which scenario we use one? > > As per my understanding repartition is very expensive operation as it needs > full shuffle then

Re: repartition in df vs partitionBy in df

2019-04-24 Thread moqi
Hello, I think you can refer to this link and hope to help you. https://stackoverflow.com/questions/40416357/spark-sql-difference-between-df-repartition-and-dataframewriter-partitionby/40417992 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: repartition in df vs partitionBy in df

2019-04-24 Thread rajat kumar
hello, thanks for quick reply. got it . partitionBy is to create something like hive partitions. but when do we use repartition actually? how to decide whether to do repartition or not? because in development we are getting sample data. also what number should I give while repartition. thanks On

Re: repartition in df vs partitionBy in df

2019-04-24 Thread moqi
Hello, there is another link to discuss the difference between the two methods. Https://stackoverflow.com/questions/33831561/pyspark-repartition-vs-partitionby In particular, when you are faced with possible data skew or have some partitioned parameters that need to be obtained at runtime, you

Re: repartition in df vs partitionBy in df

2019-04-24 Thread moqi
Hello, There is another link here that I hope will help you. https://stackoverflow.com/questions/33831561/pyspark-repartition-vs-partitionby In particular, when you are faced with possible data skew or have some partitioned parameters that need to be obtained at runtime, you can refer to this

spark stddev() giving '?' as output how to handle it ? i.e replace null/0

2019-04-24 Thread Shyam P
https://stackoverflow.com/questions/55823608/how-to-handle-spark-stddev-function-output-value-when-there-there-is-no-data Regards, Shyam

Fwd: autoBroadcastJoinThreshold not working as expected

2019-04-24 Thread Mike Chan
Dear all, I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 [image: image.png] Then we proceed to perform query. In the SQL plan, we

DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Shubham Chaurasia
Hi All, Consider the following(spark v2.4.0): Basically I change values of `spark.sql.session.timeZone` and perform an orc write. Here are 3 samples:- 1) scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata") scala> val df = sc.parallelize(Seq("2019-04-23

Handle empty partitions in pyspark

2019-04-24 Thread kanchan tewary
Hi All, I have a situation where the rdd is having some empty partitions, which I would like to identify and handle while applying mapPartitions or similar functions. Is there a way to do this in pyspark? The method isEmpty works on the rdd only and can not be applied. Much appreciated. Code

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
How did you read/write the timestamp value from/to ORC file? On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia wrote: > Hi All, > > Consider the following(spark v2.4.0): > > Basically I change values of `spark.sql.session.timeZone` and perform an > orc write. Here are 3 samples:- > > 1) >