pyspark loop optimization

2022-01-10 Thread Ramesh Natarajan
I want to compute cume_dist on a bunch of columns in a spark dataframe, but want to remove NULL values before doing so. I have this loop in pyspark. While this works, I see the driver runs at 100% while the executors are idle for the most part. I am reading that running a loop is an anti-pattern

Re: How to add a row number column with out reordering my data frame

2022-01-10 Thread Gourav Sengupta
Hi, I am a bit confused here, it is not entirely clear to me why are you creating the row numbers, and how creating the row numbers helps you with the joins? Can you please explain with some sample data? Regards, Gourav On Fri, Jan 7, 2022 at 1:14 AM Andrew Davidson wrote: > Hi > > > > I am

Re: hive table with large column data size

2022-01-10 Thread Gourav Sengupta
Hi, As always, before answering the question, can I please ask what are you trying to achieve by storing the data in a table? How are you planning to query a binary data? If you look at any relational theory, then it states that a table is a relation/ entity and the fields the attributes. You

Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-10 Thread Pralabh Kumar
Hi Spark Team When creating a database via Spark 3.0 on Hive 1) spark.sql("create database test location '/user/hive'"). It creates the database location on hdfs . As expected 2) When running the same command on 3.1 the database is created on the local file system by default. I have to prefix