Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]

via GitHub Tue, 12 Mar 2024 04:34:37 -0700


jayesh2424 commented on issue #10852:
URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991447595


   @ad1happy2go Okay,
   May be the question is not clear. But What you have suggested is have a full 
load of entire datalake. Then have it in a dataframe. So that doing 
df.createOrReplaceTempView("temp_table") will work for further filtering. 
   What you have proposed is example of Loading the datalake and then doing a 
filter job.
   
   What I am saying is filter the datalake and load that much part only.
   For example,
   This is how we load the datalake :-> datalake_full_load = 
self.spark.read.format('org.apache.hudi').load(target_path)
   
   here we can use .select() for pulling only particular columns or .filter() 
etc.
   What I want is something like 
   datalake_full_load = 
self.spark.read.format('org.apache.hudi').load(target_path).filter("select 
date(created) as created, count(*) as datalake_count from datalake group by 
date(created)")
   
   My SQL might sound weird but don't it word to word. I may want to achieve 
similar result but this is not something I am pointing. I am just saying what 
window of data I want from the Datalake. 
   
   Also the partition thing you said. My datalake is partitioned but it's not 
partition on the basis of dates. And I want data based on dates. Here, the 
created is timestamp value and hence not used as a partition in my datalake.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] Needed a way to load the specific data from the HUDI DATALAKE. [hudi]

Reply via email to