ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1020059236
Thank you for looking into this.
I don't know how I could count file groups, so I listed all Parquet files in
both tables. There's `535 741` files in table_v1 and `371 102` in table_v2.
That number doesn't surprise me and if it was any performance indicator, then
reading from the first table should be slower.
You're absolutely right, 15k is too much. I had issues with executors
running out of memory (due to data skew) and tried increasing parallelism. Do
you suspect that it could be causing this issue? It's not optimal, but doesn't
create a lot of small files. Also, the same parallelism was used with both
tables.
I'm sorry if I wasn't clear, but I had run cleaner on both tables before
reading from them. I just tested it again and I can see that the last commit is
`*__clean__COMPLETED` commit. What I did was:
1. I run cleaner on the second table
```
spark-submit \
--driver-memory 8G \
--deploy-mode cluster \
--conf "spark.yarn.maxAppAttempts=1" \
--conf "spark.dynamicAllocation.maxExecutors=20" \
--class org.apache.hudi.utilities.HoodieCleaner \
hudi-utilities-bundle_2.12-0.10.0.jar \
--target-base-path s3://bucket/table_v2 \
--hoodie-conf hoodie.cleaner.parallelism=10 \
--spark-master yarn-cluster
```
There's was almost nothing to do, so it finished within 2 minutes.
2. I read one of the partitions in the second table
```
def time[T](func: => T): T = {
val t0 = System.nanoTime
val result = func
val t1 = System.nanoTime
println("Elapsed time: " + (t1-t0)/1000000000 + "s")
result
}
time {
spark.read.format("org.apache.hudi")
.option("hoodie.metadata.enable", "false")
.option("hoodie.datasource.read.paths",
"s3://bucket/table_v2/date=2022-01-01/source=test/type=test")
.load()
}
```
Logs:
```
DataSourceUtils: Getting table path..
TablePathUtils: Getting table path from path :
s3://bucket/table_v2/date=2022-01-01/source=test/type=test
DefaultSource: Obtained hudi table path: s3://bucket/table_v2
HoodieTableMetaClient: Loading HoodieTableMetaClient from
s3://bucket/table_v2
HoodieTableConfig: Loading table properties from
s3://bucket/table_v2/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE,
queryType is: snapshot
DefaultSource: Loading Base File Only View with options
:Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths ->
s3://bucket/table_v2/date=2022-01-01/source=test/type=test,
hoodie.metadata.enable -> false)
HoodieActiveTimeline: Loaded instants upto :
Option{val=[20220124110227018__clean__COMPLETED]}
HoodieTableMetaClient: Loading HoodieTableMetaClient from
s3://bucket/table_v2
HoodieTableConfig: Loading table properties from
s3://bucket/table_v2/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
HoodieTableMetaClient: Loading Active commit timeline for
s3://bucket/table_v2
HoodieActiveTimeline: Loaded instants upto :
Option{val=[20220124110227018__clean__COMPLETED]}
FileSystemViewManager: Creating InMemory based view for basePath
s3://bucket/table_v2
AbstractTableFileSystemView: Took 9286 ms to read 17 instants, 15201
replaced file groups
ClusteringUtils: Found 0 files in pending clustering operations
AbstractTableFileSystemView: Building file system view for partition
(date=2022-01-01/source=test/type=test)
AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39,
FileGroupsCreationTime=3, StoreTimeTaken=0
HoodieROTablePathFilter: Based on hoodie metadata from base path:
s3://bucket/table_v2, caching 39 files under
s3://bucket/table_v2/date=2022-01-01/source=test/type=test
AbstractTableFileSystemView: Took 8423 ms to read 17 instants, 15201
replaced file groups
ClusteringUtils: Found 0 files in pending clustering operations
Elapsed time: 20s
```
3. For comparison I read the same partition in the first table
```
time {
spark.read.format("org.apache.hudi")
.option("hoodie.metadata.enable", "false")
.option("hoodie.datasource.read.paths",
"s3://bucket/table_v1/date=2022-01-01/source=test/type=test")
.load()
}
```
Logs:
```
DataSourceUtils: Getting table path..
TablePathUtils: Getting table path from path :
s3://bucket/table_v1/date=2022-01-01/source=test/type=test
DefaultSource: Obtained hudi table path: s3://bucket/table_v1
HoodieTableMetaClient: Loading HoodieTableMetaClient from
s3://bucket/table_v1
HoodieTableConfig: Loading table properties from
s3://bucket/table_v1/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE,
queryType is: snapshot
DefaultSource: Loading Base File Only View with options
:Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths ->
s3://bucket/table_v1/date=2022-01-01/source=test/type=test,
hoodie.metadata.enable -> false)
HoodieActiveTimeline: Loaded instants upto :
Option{val=[20220124032411__clean__COMPLETED]}
HoodieTableMetaClient: Loading HoodieTableMetaClient from
s3://bucket/table_v1
HoodieTableConfig: Loading table properties from
s3://bucket/table_v1/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
HoodieTableMetaClient: Loading Active commit timeline for
s3://bucket/table_v1
HoodieActiveTimeline: Loaded instants upto :
Option{val=[20220124032411__clean__COMPLETED]}
FileSystemViewManager: Creating InMemory based view for basePath
s3://bucket/table_v1
AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file
groups
ClusteringUtils: Found 0 files in pending clustering operations
AbstractTableFileSystemView: Building file system view for partition
(date=2022-01-01/source=test/type=test)
AbstractTableFileSystemView: addFilesToView: NumFiles=20, NumFileGroups=18,
FileGroupsCreationTime=2, StoreTimeTaken=0
HoodieROTablePathFilter: Based on hoodie metadata from base path:
s3://bucket/table_v1, caching 18 files under
s3://bucket/table_v1/date=2022-01-01/source=test/type=test
AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file
groups
ClusteringUtils: Found 0 files in pending clustering operations
Elapsed time: 1s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]