yihua commented on issue #5211:
URL: https://github.com/apache/hudi/issues/5211#issuecomment-1113827711
@kartik18 To clarify, is the base path of Hudi table `s3://bucket/folder/`
or `s3://bucket/folder/cluster=abc/`? For the former case, you can directly
use `spark.read.format("org.apache.hudi").load("s3://bucket/folder")` to read
the whole table. If both `s3://bucket/folder/cluster=abc/` and
`s3://bucket/folder/cluster=efg/` are separate Hudi tables, the behavior of
reading them both in one `spark.read.format("org.apache.hudi")` statement is
undefined, since usually there is no such use case. You may still run multiple
reads and union the dataframes to get one dataframe:
```
df1 =
spark.read.format("org.apache.hudi").load("s3://bucket/folder/cluster=abc")
df2 =
spark.read.format("org.apache.hudi").load("s3://bucket/folder/cluster=efg")
df3 = df1.union(df2)
```
Note that, the glob pattern to find all parquet files in a Hudi table to
load them is deprecated, so you can just use the Hudi table base path to read
all data. Hudi maintains the format of data written to parquet files, and
manually globbing selective parquet files to read may lead to unexpected
behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]