Spark 3.0 upgrade non compatible table error

2021-01-28 Thread lsn24
Hi, The spark job where we save a dataset to a give table using exampledataset.saveAsTable() fails with the following error after upgrading from spark 2.4 to spark 3.0 Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table '': - Cannot write

Data Explosion and repartition before group bys

2020-06-26 Thread lsn24
Hi , We have a use case where one record needs to be in two different aggregations. Say for example a credit card transaction "A", which belongs to transaction category ATM and crossborder. If I need to take the count of ATM transaction, I need to consider transaction A . For count of

SortMerge Join on partitioned column causes shuffle

2019-03-26 Thread lsn24
Hello, We got two datasets thats been persisted as follows: Dataset A: datasetA.repartition(5,datasetA.col("region")) .write().mode(saveMode) .format("parquet") .partitionBy("region") .bucketBy(5,"studentId")

Spark Sql group by less performant

2018-12-10 Thread lsn24
Hello, I have a requirement where I need to get total count of rows and total count of failedRows based on a grouping. The code looks like below: myDataset.createOrReplaceTempView("temp_view"); Dataset countDataset = sparkSession.sql("Select

Re: Filter one dataset based on values from another

2018-05-01 Thread lsn24
I don't think inner join will solve my problem. *For each row in* paramsDataset, I need to filter mydataset. And then I need to run a bunch of calculation on filtered myDataset. Say for example paramsDataset has three employee age ranges . Eg: 20-30,30-50, 50-60 and regions USA,Canada.

Filter one dataset based on values from another

2018-04-30 Thread lsn24
Hi, I have one dataset with parameters and another with data that needs to apply/ filter based on the first dataset (Parameter dataset). *Scenario is as follows:* For each row in parameter dataset, I need to apply the parameter row to the second dataset.I will end up having multiple

Re: Spark parse fixed length file [Java]

2018-04-19 Thread lsn24
Thanks for the response JayeshLalwani. Clearly in my case the issue was with my approach, not with the memory. The job was taking much longer time even for smaller dataset. Thanks again! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark parse fixed length file [Java]

2018-04-19 Thread lsn24
I was able to solve it by writing a java method (to slice and dice data) and invoking the method/function from spark.map. This transformed the data way faster than my previous approach. Thanks geoHeil for the pointer. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Spark parse fixed length file [Java]

2018-04-13 Thread lsn24
Hello, We are running into issues while trying to process fixed length files using spark. The approach we took is as follows: 1. Read the .bz2 file into a dataset from hdfs using spark.read().textFile() API.Create a temporary view. Dataset rawDataset =