Team,

I have a question on keeping hive in sync.  Due to a shared Hadoop
Environment restricting me from using hudi 0.5.1 or higher version i ended
up using 0.5.0.  Currently my hadoop cluster is having hive 1.2.x , which
is not supporting Hudi to keep hive in sync.

So , I am not using the hive feature. I am reading it as below.


sparkSession.
read.
format("org.apache.hudi").
load("/projects/cdp/data/base/request_application/*/*").
createOrReplaceTempView(s"base_request_application")


I am going to store 3 years worth of data partitioned by day/hour. When I
load 3 years data, I would have (3*365*24) = 26280 directories. Using the
above approach and reading every time, I see all the directories names are
indexed. Would it impact the perfromance during joining with other table,
if i dont use hive way of partition pruning?

Thanks,
Selva

Reply via email to