Fwd: HDFS file hdfs://127.0.0.1:9000/hdfs/spark/examples/README.txt

2020-04-05 Thread jane thorpe
Hi Som , Did you know that simple demo program of reading characters from file didn't work ? Who wrote that simple hello world type little program ? jane thorpe janethor...@aol.com -Original Message- From: jane thorpe To: somplasticllc ; user Sent: Fri, 3 Apr 2020 2:44 Subject: Re:

Re: Spark, read from Kafka stream failing AnalysisException

2020-04-05 Thread Tathagata Das
Have you looked at the suggestion made by the error by searching for "Structured Streaming + Kafka Integration Guide" in Google? It should be the first result. The last section in the "Structured Streaming

Re: spark-submit exit status on k8s

2020-04-05 Thread Yinan Li
Not sure if you are aware of this new feature in Airflow https://issues.apache.org/jira/browse/AIRFLOW-6542. It's a way to use Airflow to orchestrate spark applications run using the Spark K8S operator ( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). On Sun, Apr 5, 2020 at 8:25 AM

Re: pandas_udf is very slow

2020-04-05 Thread Lian Jiang
Thanks Silvio. I need grouped map pandas UDF which takes a spark data frame as the input and outputs a spark data frame having a different shape from input. Grouped map is kind of unique to pandas udf and I have trouble to find a similar non pandas udf for an apple to apple comparison. Let me kn

Re: spark-submit exit status on k8s

2020-04-05 Thread Masood Krohy
Another, simpler solution that I just thought of: just add an operation at the end of your Spark program to write an empty file somewhere, with filename SUCCESS for example. Add a stage to your AirFlow graph to check the existence of this file after running spark-submit. If the file is absent,

Spark, read from Kafka stream failing AnalysisException

2020-04-05 Thread Sumit Agrawal
Hello, I am using Spark 2.4.5, Kafka 2.3.1 on my local machine. I am able to produce and consume messages on Kafka with bootstrap server config "localhost:9092” While trying to setup reader with spark streaming API, I am getting an error as Exception Message: Py4JJavaError: An error occurred

Re: pandas_udf is very slow

2020-04-05 Thread Silvio Fiorito
Your 2 examples are doing different things. The Pandas UDF is doing a grouped map, whereas your Python UDF is doing an aggregate. I think you want your Pandas UDF to be PandasUDFType.GROUPED_AGG? Is your result the same? From: Lian Jiang Date: Sunday, April 5, 2020 at 3:28 AM To: user Subjec

pandas_udf is very slow

2020-04-05 Thread Lian Jiang
Hi, I am using pandas udf in pyspark 2.4.3 on EMR 5.21.0. pandas udf is favored over non pandas udf per https://www.twosigma.com/wp-content/uploads/Jin_-_Improving_Python__Spark_Performance_-_Spark_Summit_West.pdf. My data has about 250M records and the pandas udf code is like: def pd_udf_func(