Spark 3.1.3 with Hive dynamic partitions fails while driver moves the staged files

2023-12-11 Thread Shay Elbaz
Hi all, Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert new (time) partitions into existing Hive tables. But we see too many failures coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where the driver moves the successful files from

Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Shay Elbaz
If you have found a parser that works, simply read the data as text files, apply the parser manually, and convert to DataFrame (if needed at all), From: Saurabh Gulati Sent: Wednesday, January 4, 2023 3:45 PM To: Sean Owen Cc: Mich Talebzadeh ; User Subject:

Re: How to set a config for a single query?

2023-01-04 Thread Shay Elbaz
Hi Felipe, I had the same problem - needed to execute multiple jobs/actions multithreaded, with slightly different sql configs per job (mainly spark.sql.shuffle.partitions). I'm not sure if this is the best solution, but I ended up using newSession() per thread. It works well except for the

Re: [EXTERNAL] [SPARK Memory management] Does Spark support setting limits/requests for driver/executor memory ?

2022-12-08 Thread Shay Elbaz
Had the same issue, it seems that it is simply not possible - https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L195 There's also a Jira ticket -

Re: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-14 Thread Shay Elbaz
stribution package doesn't include Rapids or any GPU integration libs). So you may want to dive into the Rapids instructions for more configuration and usage info (it does provide detailed instructions on how to run Rapids on EMR, Databricks and GCP). On 11/3/22 12:10 PM, Shay Elbaz wrote: Th

Re: [EXTERNAL] Re: Re: Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-06 Thread Shay Elbaz
, 2022 4:19 PM To: Shay Elbaz Cc: Artemis User ; Tom Graves ; Tom Graves ; user@spark.apache.org Subject: [EXTERNAL] Re: Re: Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs ATTENTION: This email originated from outside of GM. May I ask why the ETL job and DL

Re: [EXTERNAL] Re: Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-05 Thread Shay Elbaz
mis User ; user@spark.apache.org ; Shay Elbaz Subject: [EXTERNAL] Re: Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs ATTENTION: This email originated from outside of GM. So I'm not sure I completely follow. Are you asking for a way to change the limit with

Re: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Shay Elbaz
, November 3, 2022 7:56 PM To: Artemis User ; user@spark.apache.org ; Shay Elbaz Subject: [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs ATTENTION: This email originated from outside of GM. Stage level scheduling does not allow you to change configs

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Shay Elbaz
). I wish the Spark doc team could provide more details in the next release... On 11/3/22 2:37 AM, Shay Elbaz wrote: Thanks Artemis. We are not using Rapids, but rather using GPUs through the Stage Level Scheduling feature with ResourceProfile. In Kubernetes you have to turn on shuffle tracking

Re: [EXTERNAL] Re: Stage level scheduling - lower the number of executors when using GPUs

2022-11-03 Thread Shay Elbaz
. In that case, you may also need to turn on shuffle tracking. 3. The "stages" are controlled by the APIs. The APIs for dynamic resource request (change of stage) do exist, but only for RDDs (e.g. TaskResourceRequest and ExecutorResourceRequest). On 11/2/22 11:30 AM, Shay Elbaz

Stage level scheduling - lower the number of executors when using GPUs

2022-11-02 Thread Shay Elbaz
Hi, Our typical applications need less executors for a GPU stage than for a CPU stage. We are using dynamic allocation with stage level scheduling, and Spark tries to maximize the number of executors also during the GPU stage, causing a bit of resources chaos in the cluster. This forces us to

PySpark schema sanitization

2022-08-14 Thread Shay Elbaz
Hi, I have a simple ETL application, where the data source schama needs to be sanitized. Column names might include special characters that need to be removed. For example, from "some{column}" to "some_column". Normally I'd just alias the columns, but in this case the schema can have thousands

Re: [EXTERNAL] Partial data with ADLS Gen2

2022-07-24 Thread Shay Elbaz
This is a known issue. Apache Iceberg, Hudi and Delta lake and among the possible solutions. Alternatively, instead of writing the output directly to the "official" location, write it to some staging directory instead. Once the job is done, rename the staging dir to the official location.

Re: spark.executor.pyspark.memory not added to the executor resource request on Kubernetes

2022-07-19 Thread Shay Elbaz
... using spark 3.2.1 From: Shay Elbaz Sent: Tuesday, July 19, 2022 1:26 PM To: user@spark.apache.org Cc: Jeffrey O'Donoghue Subject: [EXTERNAL] spark.executor.pyspark.memory not added to the executor resource request on Kubernetes ATTENTION: This email

spark.executor.pyspark.memory not added to the executor resource request on Kubernetes

2022-07-19 Thread Shay Elbaz
Hi, We are trying tune executor memory on Kubernetes. Specifically, 8g for the jvm, 8g for the python process, and additional 500m overhead: --conf spark.executor.memory=8g --conf spark.executor.pyspark.memory=8g --conf spark.executor.memoryOverhead=500m According the docs, the executor pods

Re: [EXTERNAL] spark re-use shuffle files not happening

2022-07-16 Thread Shay Elbaz
Spark can reuse shuffle stages in the same job (action), not cross jobs. From: Koert Kuipers Sent: Saturday, July 16, 2022 6:43 PM To: user Subject: [EXTERNAL] spark re-use shuffle files not happening ATTENTION: This email originated from outside of GM. i

Re: [EXTERNAL] RDD.pipe() for binary data

2022-07-10 Thread Shay Elbaz
Yuhao, You can use pyspark as entrypoint to your application. With py4j you can call Java/Scala functions from the python application. There's no need to use the pipe() function for that. Shay From: Yuhao Zhang Sent: Saturday, July 9, 2022 4:13:42 AM To:

How to update TaskMetrics from Python?

2022-06-16 Thread Shay Elbaz
Hi All, I have some data output source which can only be written to by a specific Python API. For that I am (ab)using foreachPartition(writing_func) from PySpark which works pretty well. I wonder if its possible to somehow update the task metrics - specifically setBytesWritten - at the end

RE: [EXTERNAL] Re: Spark on K8s - repeating annoying exception

2022-05-15 Thread Shay Elbaz
Hi Martin, Thanks for the help :) I tried to set those keys to high value but the error persists every 90 seconds. Shay From: Martin Grigorov Sent: Friday, May 13, 2022 4:15 PM To: Shay Elbaz Cc: user@spark.apache.org Subject: [EXTERNAL] Re: Spark on K8s - repeating annoying exception

Spark on K8s - repeating annoying exception

2022-05-09 Thread Shay Elbaz
Hi all, I apologize for reposting this from Stack Overflow, but it got very little attention and now comment. I'm using Spark 3.2.1 image that was built from the official distribution via `docker-image-tool.sh', on Kubernetes 1.18 cluster. Everything works fine, except for this error message

RE: [EXTERNAL] Parse Execution Plan from PySpark

2022-05-03 Thread Shay Elbaz
Hi Pablo, As you probably know, Spark SQL generates custom Java code for the SQL functions. You can use geometry.debugCodegen() to print out the generated code. Shay From: Pablo Alcain Sent: Tuesday, May 3, 2022 6:07 AM To: user@spark.apache.org Subject: [EXTERNAL] Parse Execution Plan from

This is a blog post explaining how to use a new Spark library, datafu-spark

2021-07-21 Thread Shay Elbaz
https://medium.com/paypal-tech/introducing-datafu-spark-ba67faf1933a [https://miro.medium.com/max/1200/0*koSzBO7KqbmIpiPl] Introducing DataFu-Spark. DataFu-Spark is a new addition to… | by Eyal Allweil | Technology at PayPal