jayesh2424 commented on issue #10852:
URL: https://github.com/apache/hudi/issues/10852#issuecomment-1991447595
@ad1happy2go Okay,
May be the question is not clear. But What you have suggested is have a full
load of entire datalake. Then have it in a dataframe. So that doing
df.createOrReplaceTempView("temp_table") will work for further filtering.
What you have proposed is example of Loading the datalake and then doing a
filter job.
What I am saying is filter the datalake and load that much part only.
For example,
This is how we load the datalake :-> datalake_full_load =
self.spark.read.format('org.apache.hudi').load(target_path)
here we can use .select() for pulling only particular columns or .filter()
etc.
What I want is something like
datalake_full_load =
self.spark.read.format('org.apache.hudi').load(target_path).filter("select
date(created) as created, count(*) as datalake_count from datalake group by
date(created)")
My SQL might sound weird but don't it word to word. I may want to achieve
similar result but this is not something I am pointing. I am just saying what
window of data I want from the Datalake.
Also the partition thing you said. My datalake is partitioned but it's not
partition on the basis of dates. And I want data based on dates. Here, the
created is timestamp value and hence not used as a partition in my datalake.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]