Hi,
If you use year=xxx/month=xxx folder structure, you can use Dataset<Row>
df=
spark.read().format("hudi").schema(schema).load(<base_path>+<table_name>).
Without a glob postfix, Spark can automatically load the partition
information, just like regular parquet files.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery
If you use something like 2020/06, you may need to build the glob string
and add it to the load() to skip the unnecessary partitions. e.g.
.load(<base_path>+<table_name>+"2020/{05,06}")
Or list one parquet file from different partitions and use a map function
to load 1 row from each path with a limit clause.
On Fri, Jun 26, 2020 at 8:33 AM Tanuj <[email protected]> wrote:
> Hi,
> We have created a table with partition depth of 2 as year/month. We need
> to read data from HUDI in Spark Streaming layer where we get the batch data
> of say 10 rows which we need to use to read from HUDI. We are reading it
> like -
>
> // Read from HUDI
> Dataset<Row> df=
> spark.read().format("hudi").schema(schema).load(<base_path>+<table_name>+"/*/*")
>
> //Apply filter
>
> df=df.filter(df.col("year").isin(<vals>).filter(df.col("month").isin(<vals>)).filter(df.col("id").isin(<vals>));
>
> Is it the best way to read the data ? Will HUDI take care of just reading
> from the partitions or we need to take care of ? For eg. If I need to read
> just 1 row we can build the full path and then read which will read the
> parquet file from that partition quickly but here our requirement is to
> read data from multiple partitions.
>
>
>