[GitHub] [hudi] yihua commented on issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark

GitBox Fri, 29 Apr 2022 15:29:01 -0700


yihua commented on issue #5211:
URL: https://github.com/apache/hudi/issues/5211#issuecomment-1113827711


   @kartik18 To clarify, is the base path of Hudi table `s3://bucket/folder/` 
or `s3://bucket/folder/cluster=abc/`?  For the former case, you can directly 
use `spark.read.format("org.apache.hudi").load("s3://bucket/folder")` to read 
the whole table.  If both `s3://bucket/folder/cluster=abc/` and 
`s3://bucket/folder/cluster=efg/` are separate Hudi tables, the behavior of 
reading them both in one `spark.read.format("org.apache.hudi")` statement is 
undefined, since usually there is no such use case.  You may still run multiple 
reads and union the dataframes to get one dataframe:
   ```
   df1 = 
spark.read.format("org.apache.hudi").load("s3://bucket/folder/cluster=abc")
   df2 = 
spark.read.format("org.apache.hudi").load("s3://bucket/folder/cluster=efg")
   df3 = df1.union(df2)
   ```
   
   Note that, the glob pattern to find all parquet files in a Hudi table to 
load them is deprecated, so you can just use the Hudi table base path to read 
all data.  Hudi maintains the format of data written to parquet files, and 
manually globbing selective parquet files to read may lead to unexpected 
behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark

Reply via email to