[GitHub] [hudi] haggy commented on issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark

GitBox Mon, 08 Aug 2022 10:54:45 -0700


haggy commented on issue #5211:
URL: https://github.com/apache/hudi/issues/5211#issuecomment-1208430291


   > The prior wildcard pattern is not meant to be used for a subset of 
partitions in a Hudi table. Instead of globbing selective paths, you can still 
load the Hudi table with the base path and use .filter with the partition 
values so Spark does the partition pruning without scanning the whole table
   
   @yihua  I was not able to reproduce this behavior at all in our data lake 
tables. Our tables are partitioned by `year/month/day/hour`. When running the 
following query, Hudi first lists every file in the entire table for metadata  
(which is huge so it never finishes). The only way I was able to run a query 
that prunes partitions in Hudi 0.10.1 was to do the following (pyspark):
   
   ```python
   read_paths = [
       f"{datalake_url}/year=2022/month=7/*/*",
   ]
   
   hudi_opts = {
       "hoodie.datasource.query.type": "snapshot",
       "hoodie.file.index.enable": False,
       "hoodie.datasource.read.paths": ",".join(read_paths)
   }
   
   df = (
       spark.read 
           .format("hudi") 
           .options(**hudi_opts)
           .load() 
   )
   
   df.count()
   ```
   
   This resulted in only loading the partitions for `year=2022` and `month=7`. 
   
   In Hudi version `0.9.x` we were able to use hive-style partition glob paths 
but this went away in `0.10.x` due to changes with glob path parsing: 
`f"{datalake_url}/year=2022/month=7/day=*/hour=*"`
   
   @kartik18 Could you try a couple things to confirm this behavior for you as 
well?
   
   ## Load DF with file index disabled
   
   ```python
   read_paths = [
       f"{datalake_url}/cluster=abc/*",
       f"{datalake_url}/cluster=def/*"
   ]
   
   hudi_opts = {
       "hoodie.datasource.query.type": "snapshot",
       "hoodie.file.index.enable": False,
       "hoodie.datasource.read.paths": ",".join(read_paths)
   }
   
   df = (
       spark.read 
           .format("hudi") 
           .options(**hudi_opts)
           .load() 
   )
   
   df.count()
   ```
   
   ## Load DF with file index enabled
   
   ```python
   read_paths = [
       f"{datalake_url}/cluster=abc/*",
       f"{datalake_url}/cluster=def/*"
   ]
   
   
   hudi_opts = {
       "hoodie.datasource.query.type": "snapshot",
       "hoodie.datasource.read.paths": ",".join(read_paths)
   }
   
   df = (
       spark.read 
           .format("hudi") 
           .options(**hudi_opts)
           .load() 
   )
   
   df.count()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] haggy commented on issue #5211: [SUPPORT] Glob pattern to pick specific subfolders not working while reading in Spark

Reply via email to