ganczarek opened a new issue #4656:
URL: https://github.com/apache/hudi/issues/4656
## Description
I have two tables with large amount of partitions (~300k). Both contain
almost the same data, but were created and updated
with slightly different configurations and versions of Hudi. For some reason
I see a significant time difference in file
listing when reading both tables. A new table spends much more time listing
files in many instants, when in the other
table there are none (please see logs below).
The first table is managed by Hudi 0.8.0. It was created with a few INSERT
commits and then updated daily with UPSERT
operation. Table is auto cleaned after each commit.
Hudi configuration:
```
HoodieWriteConfig.TABLE_NAME -> "table_v1",
DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "event_id",
DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "generated_at",
DataSourceWriteOptions.OPERATION_OPT_KEY ->
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.TABLE_TYPE_OPT_KEY ->
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY ->
"date,source,type",
DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY ->
classOf[ComplexKeyGeneratorWithLowerCasePartitionPath].getName,
DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",
HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES ->
200.mb.toBytes.toLong.toString,
HoodieStorageConfig.PARQUET_FILE_MAX_BYTES ->
1.gb.toBytes.toLong.toString,
DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "false",
HoodieMetadataConfig.METADATA_ENABLE_PROP -> "true",
HoodieWriteConfig.UPSERT_PARALLELISM -> "15000"
```
The second table was created after application pipeline was migrated to Hudi
0.10.0. The table was created with a few
INSERT_OVERWRITE commits and then updated daily with UPSERT operation. Table
auto clean is disabled, because cleaning operation suffered from long file
listing times (it always took ~3 hours). Instead the table is cleaned with
`org.apache.hudi.utilities.HoodieCleaner` later and takes about 30 minutes.
Hudi configuration:
```
HoodieWriteConfig.TBL_NAME.key -> "table_v2",
KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key -> "event_id",
HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key -> "generated_at",
DataSourceWriteOptions.OPERATION.key ->
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.TABLE_TYPE.key ->
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key ->
"date,source,type",
HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key ->
classOf[ComplexKeyGenerator].getName,
KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key -> "true",
HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key ->
192.mb.toBytes.toLong.toString,
HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key ->
256.mb.toBytes.toLong.toString,
DataSourceWriteOptions.HIVE_SYNC_ENABLED.key -> "false",
HoodieMetadataConfig.ENABLE.key -> "true",
HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key -> "15000",
HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key -> "false",
HoodieCompactionConfig.AUTO_CLEAN.key -> "false"
```
Only differences I can see between both tables are:
- use of different Hudi versions (0.8.0 vs 0.10.0)
- different output parquet file sizes
- disabled auto clean
- use of INSERT_OVERWRITE instead of INSERT during initial backfill
I would appreciate help answering a few questions:
- Why clean operation is much slower (minutes vs hours) between Hudi 0.8.0
and 0.10.0? I know it's because of number of
partitions, but is it possible to bring old performance with some
configuration changes?
- Why file listing times for both tables are so different? How could it be
fixed?
Thanks!
## How tables are read
I cleaned both tables and read from them a few partitions using Hudi 0.10.0.
I disabled table metadata and
provide paths to specific partitions in `READ_PATHS`.
Example:
```scala
spark.read.format("org.apache.hudi").
option("hoodie.metadata.enable", "false").
option("hoodie.datasource.read.paths",
"s3://bucket/table_v1/date=2021-12-30/source=test/type=test,s3://bucket/table_v1/date=2021-12-31/source=test/type=test").
load()
```
**Expected behavior**
It used to take a few seconds to list files in provided partitions, but now
it takes minutes.
**Environment Description**
* Hudi version : 0.10.0
* Spark version : 3.1.1
* Hadoop version : 3.2.1
* Storage : S3
* Running on Docker? : no
**Stacktrace**
Logs from reading table 1 (fast):
```
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced
file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath
s3://bucket/table_v1
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced
file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition
(date=2021-12-30/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=28,
NumFileGroups=27, FileGroupsCreationTime=1, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path:
s3://bucket/table_v1, caching 27 files under
s3://bucket/table_v1/date=2021-12-30/source=test/type=test
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced
file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath
s3://bucket/table_v1
INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced
file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition
(date=2021-12-31/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=21,
NumFileGroups=20, FileGroupsCreationTime=1, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path:
s3://bucket/table_v1, caching 20 files under
s3://bucket/table_v1/date=2021-12-31/source=test/type=test
```
Logs from reading table 2 (slow):
```
INFO AbstractTableFileSystemView: Took 8508 ms to read 17 instants, 15201
replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath
s3://bucket/table_v2
INFO AbstractTableFileSystemView: Took 8468 ms to read 17 instants, 15201
replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition
(date=2021-12-19/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=47,
NumFileGroups=46, FileGroupsCreationTime=3, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path:
s3://bucket/table_v2, caching 46 files under
s3://bucket/table_v2/date=2021-12-19/source=test/type=test
INFO AbstractTableFileSystemView: Took 8513 ms to read 17 instants, 15201
replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO FileSystemViewManager: Creating InMemory based view for basePath
s3://bucket/table_v2
INFO AbstractTableFileSystemView: Took 9192 ms to read 17 instants, 15201
replaced file groups
INFO ClusteringUtils: Found 0 files in pending clustering operations
INFO AbstractTableFileSystemView: Building file system view for partition
(date=2021-12-21/source=test/type=test)
INFO AbstractTableFileSystemView: addFilesToView: NumFiles=71,
NumFileGroups=70, FileGroupsCreationTime=5, StoreTimeTaken=0
INFO HoodieROTablePathFilter: Based on hoodie metadata from base path:
s3://bucket/table_v2, caching 70 files under
s3://bucket/table_v2/date=2021-12-21/source=test/type=test
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]