[GitHub] [hudi] ganczarek commented on issue #4656: [Support] Slow file listing after update to Hudi 0.10.0

GitBox Mon, 24 Jan 2022 04:42:03 -0800


ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1020059236



   Thank you for looking into this.
   
   I don't know how I could count file groups, so I listed all Parquet files in 
both tables. There's `535 741` files in table_v1 and `371 102` in table_v2. 
That number doesn't surprise me and if it was any performance indicator, then 
reading from the first table should be slower.
   
   You're absolutely right, 15k is too much. I had issues with executors 
running out of memory (due to data skew) and tried increasing parallelism. Do 
you suspect that it could be causing this issue? It's not optimal, but doesn't 
create a lot of small files. Also, the same parallelism was used with both 
tables.
   
   I'm sorry if I wasn't clear, but I had run cleaner on both tables before 
reading from them. I just tested it again and I can see that the last commit is 
`*__clean__COMPLETED` commit. What I did was:
   1. I run cleaner on the second table
   ```
   spark-submit \
       --driver-memory 8G \
       --deploy-mode cluster \
       --conf "spark.yarn.maxAppAttempts=1" \
       --conf "spark.dynamicAllocation.maxExecutors=20" \
       --class org.apache.hudi.utilities.HoodieCleaner \
       hudi-utilities-bundle_2.12-0.10.0.jar \
       --target-base-path s3://bucket/table_v2 \
       --hoodie-conf hoodie.cleaner.parallelism=10 \
       --spark-master yarn-cluster
   ```
   There's was almost nothing to do, so it finished within 2 minutes.
   
   2. I read one of the partitions in the second table
   ```
   def time[T](func: => T): T = {
       val t0 = System.nanoTime
       val result = func
       val t1 = System.nanoTime
       println("Elapsed time: " + (t1-t0)/1000000000 + "s")
       result
   }
   
   time { 
        spark.read.format("org.apache.hudi")
        .option("hoodie.metadata.enable", "false")
        .option("hoodie.datasource.read.paths", 
"s3://bucket/table_v2/date=2022-01-01/source=test/type=test")
        .load() 
   }
   ```
   Logs:
   ```
   DataSourceUtils: Getting table path..
   TablePathUtils: Getting table path from path : 
s3://bucket/table_v2/date=2022-01-01/source=test/type=test
   DefaultSource: Obtained hudi table path: s3://bucket/table_v2
   HoodieTableMetaClient: Loading HoodieTableMetaClient from 
s3://bucket/table_v2
   HoodieTableConfig: Loading table properties from 
s3://bucket/table_v2/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
   DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, 
queryType is: snapshot
   DefaultSource: Loading Base File Only View  with options 
:Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> 
s3://bucket/table_v2/date=2022-01-01/source=test/type=test, 
hoodie.metadata.enable -> false)
   HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220124110227018__clean__COMPLETED]}
   HoodieTableMetaClient: Loading HoodieTableMetaClient from 
s3://bucket/table_v2
   HoodieTableConfig: Loading table properties from 
s3://bucket/table_v2/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
   HoodieTableMetaClient: Loading Active commit timeline for 
s3://bucket/table_v2
   HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220124110227018__clean__COMPLETED]}
   FileSystemViewManager: Creating InMemory based view for basePath 
s3://bucket/table_v2
   AbstractTableFileSystemView: Took 9286 ms to read  17 instants, 15201 
replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   AbstractTableFileSystemView: Building file system view for partition 
(date=2022-01-01/source=test/type=test)
   AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39, 
FileGroupsCreationTime=3, StoreTimeTaken=0
   HoodieROTablePathFilter: Based on hoodie metadata from base path: 
s3://bucket/table_v2, caching 39 files under 
s3://bucket/table_v2/date=2022-01-01/source=test/type=test
   AbstractTableFileSystemView: Took 8423 ms to read  17 instants, 15201 
replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   Elapsed time: 20s
   ```
   
   3. For comparison I read the same partition in the first table
   ```
   time { 
        spark.read.format("org.apache.hudi")
        .option("hoodie.metadata.enable", "false")
        .option("hoodie.datasource.read.paths", 
"s3://bucket/table_v1/date=2022-01-01/source=test/type=test")
        .load() 
   }
   ```
   Logs:
   ```
   DataSourceUtils: Getting table path..
   TablePathUtils: Getting table path from path : 
s3://bucket/table_v1/date=2022-01-01/source=test/type=test
   DefaultSource: Obtained hudi table path: s3://bucket/table_v1
   HoodieTableMetaClient: Loading HoodieTableMetaClient from 
s3://bucket/table_v1
   HoodieTableConfig: Loading table properties from 
s3://bucket/table_v1/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
   DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, 
queryType is: snapshot
   DefaultSource: Loading Base File Only View  with options 
:Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> 
s3://bucket/table_v1/date=2022-01-01/source=test/type=test, 
hoodie.metadata.enable -> false)
   HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220124032411__clean__COMPLETED]}
   HoodieTableMetaClient: Loading HoodieTableMetaClient from 
s3://bucket/table_v1
   HoodieTableConfig: Loading table properties from 
s3://bucket/table_v1/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
   HoodieTableMetaClient: Loading Active commit timeline for 
s3://bucket/table_v1
   HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20220124032411__clean__COMPLETED]}
   FileSystemViewManager: Creating InMemory based view for basePath 
s3://bucket/table_v1
   AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file 
groups
   ClusteringUtils: Found 0 files in pending clustering operations
   AbstractTableFileSystemView: Building file system view for partition 
(date=2022-01-01/source=test/type=test)
   AbstractTableFileSystemView: addFilesToView: NumFiles=20, NumFileGroups=18, 
FileGroupsCreationTime=2, StoreTimeTaken=0
   HoodieROTablePathFilter: Based on hoodie metadata from base path: 
s3://bucket/table_v1, caching 18 files under 
s3://bucket/table_v1/date=2022-01-01/source=test/type=test
   AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file 
groups
   ClusteringUtils: Found 0 files in pending clustering operations
   Elapsed time: 1s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] ganczarek commented on issue #4656: [Support] Slow file listing after update to Hudi 0.10.0

Reply via email to