date:20190424

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Shubham Chaurasia

Writing: scala> df.write.orc("") For looking into contents, I used orc-tools-X.Y.Z-uber.jar ( https://orc.apache.org/docs/java-tools.html) On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan wrote: > How did you read/write the timestamp value from/to ORC file? > > On Wed, Apr 24, 2019 at 6:30 PM

Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Jeff Evans

The org.apache.spark.launcher.SparkLauncher is used to construct a spark-submit invocation programmatically, via a builder pattern. In our application, which uses a SparkLauncher internally, I would like to log the full spark-submit command that it will invoke to our log file, in order to aid in

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Marcelo Vanzin

Setting the SPARK_PRINT_LAUNCH_COMMAND env variable to 1 in the launcher env will make Spark code print the command to stderr. Not optimal but I think it's the only current option. On Wed, Apr 24, 2019 at 1:55 PM Jeff Evans wrote: > > The org.apache.spark.launcher.SparkLauncher is used to

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Marcelo Vanzin

BTW the SparkLauncher API has hooks to capture the stderr of the spark-submit process into the logging system of the parent process. Check the API javadocs since it's been forever since I looked at that. On Wed, Apr 24, 2019 at 1:58 PM Marcelo Vanzin wrote: > > Setting the

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Sebastian Piu

You could set the env var SPARK_PRINT_LAUNCH_COMMAND and spark-submit will print it, but it will be printed by the subprocess and not yours unless you redirect the stdout Also the command is what spark-submit generates, so it is quite more verbose and includes the classpath etc. I think the

'No plan for EventTimeWatermark' error while using structured streaming with column pruning (spark 2.3.1)

2019-04-24 Thread kineret M

Hi All, I get 'No plan for EventTimeWatermark' error while doing a query with columns pruning using structured streaming with a custom data source that implements Spark datasource v2. My data source implementation that handles the schemas includes the following: class MyDataSourceReader extends

Handle Null Columns in Spark Structured Streaming Kafka

2019-04-24 Thread SNEHASISH DUTTA

Hi, While writing to kafka using spark structured streaming , if all the values in certain column are Null it gets dropped Is there any way to override this , other than using na.fill functions Regards, Snehasish

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan

Did you re-create your df when you update the timezone conf? On Wed, Apr 24, 2019 at 9:18 PM Shubham Chaurasia wrote: > Writing: > scala> df.write.orc("") > > For looking into contents, I used orc-tools-X.Y.Z-uber.jar ( > https://orc.apache.org/docs/java-tools.html) > > On Wed, Apr 24, 2019 at

RDD vs Dataframe & when to persist

2019-04-24 Thread Rishi Shah

Hello All, I run into situations where I ask myself should I write map partitions function on RDD or use dataframe all the way (with column + group by ) approach.. I am using Pyspark 2.3 (python 2.7).. I understand we should be utilizing dataframe as much as possible but at time it feels like RDD

[pyspark] Use output of one aggregated function for another aggregated function within the same groupby

2019-04-24 Thread Rishi Shah

Hi All, [PySpark 2.3, python 2.7] I would like to achieve something like this, could you please suggest best way to implement (perhaps highlight pros & cons of the approach in terms of performance)? df = df.groupby('grp_col').agg(max(date).alias('max_date'), count(when col('file_date') ==

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

2019-04-24 Thread Jeff Evans

Thanks for the pointers. We figured out the stdout/stderr capture piece. I was just looking to capture the full command in order to help debug issues we run into with the submit depending on various combinations of all parameters/classpath, and also to isolate job specific issues from our

Re: [pyspark] Use output of one aggregated function for another aggregated function within the same groupby

2019-04-24 Thread Georg Heiler

Is analytical window funktions to rank the result and then filter to the desired rank. Rishi Shah schrieb am Do. 25. Apr. 2019 um 05:07: > Hi All, > > [PySpark 2.3, python 2.7] > > I would like to achieve something like this, could you please suggest best > way to implement (perhaps highlight

Re: repartition in df vs partitionBy in df

2019-04-24 Thread rajat kumar

Hi All, Can anyone explain? thanks rajat On Sun, 21 Apr 2019, 00:18 kumar.rajat20del Hi Spark Users, > > repartition and partitionBy seems to be very same in Df. > In which scenario we use one? > > As per my understanding repartition is very expensive operation as it needs > full shuffle then

Re: repartition in df vs partitionBy in df

2019-04-24 Thread moqi

Hello, I think you can refer to this link and hope to help you. https://stackoverflow.com/questions/40416357/spark-sql-difference-between-df-repartition-and-dataframewriter-partitionby/40417992 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: repartition in df vs partitionBy in df

2019-04-24 Thread rajat kumar

hello, thanks for quick reply. got it . partitionBy is to create something like hive partitions. but when do we use repartition actually? how to decide whether to do repartition or not? because in development we are getting sample data. also what number should I give while repartition. thanks On

Re: repartition in df vs partitionBy in df

2019-04-24 Thread moqi

Hello, there is another link to discuss the difference between the two methods. Https://stackoverflow.com/questions/33831561/pyspark-repartition-vs-partitionby In particular, when you are faced with possible data skew or have some partitioned parameters that need to be obtained at runtime, you

Re: repartition in df vs partitionBy in df

2019-04-24 Thread moqi

Hello, There is another link here that I hope will help you. https://stackoverflow.com/questions/33831561/pyspark-repartition-vs-partitionby In particular, when you are faced with possible data skew or have some partitioned parameters that need to be obtained at runtime, you can refer to this

spark stddev() giving '?' as output how to handle it ? i.e replace null/0

2019-04-24 Thread Shyam P

https://stackoverflow.com/questions/55823608/how-to-handle-spark-stddev-function-output-value-when-there-there-is-no-data Regards, Shyam

Fwd: autoBroadcastJoinThreshold not working as expected

2019-04-24 Thread Mike Chan

Dear all, I'm on a case that when certain table being exposed to broadcast join, the query will eventually failed with remote block error. Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 10485760 [image: image.png] Then we proceed to perform query. In the SQL plan, we

DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Shubham Chaurasia

Hi All, Consider the following(spark v2.4.0): Basically I change values of `spark.sql.session.timeZone` and perform an orc write. Here are 3 samples:- 1) scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata") scala> val df = sc.parallelize(Seq("2019-04-23

Handle empty partitions in pyspark

2019-04-24 Thread kanchan tewary

Hi All, I have a situation where the rdd is having some empty partitions, which I would like to identify and handle while applying mapPartitions or similar functions. Is there a way to do this in pyspark? The method isEmpty works on the rdd only and can not be applied. Much appreciated. Code

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan

How did you read/write the timestamp value from/to ORC file? On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia wrote: > Hi All, > > Consider the following(spark v2.4.0): > > Basically I change values of `spark.sql.session.timeZone` and perform an > orc write. Here are 3 samples:- > > 1) >

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Is it possible to obtain the full command to be invoked by SparkLauncher?

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

'No plan for EventTimeWatermark' error while using structured streaming with column pruning (spark 2.3.1)

Handle Null Columns in Spark Structured Streaming Kafka

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

RDD vs Dataframe & when to persist

[pyspark] Use output of one aggregated function for another aggregated function within the same groupby

Re: Is it possible to obtain the full command to be invoked by SparkLauncher?

Re: [pyspark] Use output of one aggregated function for another aggregated function within the same groupby

Re: repartition in df vs partitionBy in df

Re: repartition in df vs partitionBy in df

Re: repartition in df vs partitionBy in df

Re: repartition in df vs partitionBy in df

Re: repartition in df vs partitionBy in df

spark stddev() giving '?' as output how to handle it ? i.e replace null/0

Fwd: autoBroadcastJoinThreshold not working as expected

DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Handle empty partitions in pyspark

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

22 matches

Site Navigation

Mail list logo

Footer information