Hi,
We have created a table with partition depth of 2 as year/month. We need to 
read data from HUDI in Spark Streaming layer where we get the batch data of say 
10 rows which we need to use to read from HUDI. We are reading it like -

// Read from HUDI
Dataset<Row> df= 
spark.read().format("hudi").schema(schema).load(<base_path>+<table_name>+"/*/*")

//Apply filter
df=df.filter(df.col("year").isin(<vals>).filter(df.col("month").isin(<vals>)).filter(df.col("id").isin(<vals>));

Is it the best way to read the data ? Will HUDI take care of just reading from 
the partitions or we need to take care of ? For eg. If I need to read just 1 
row we can build the full path and then read which will read the parquet file 
from that partition quickly but here our requirement is to read data from 
multiple partitions.


Reply via email to