parisni opened a new issue, #6056:
URL: https://github.com/apache/hudi/issues/6056
spark 3.2.1
hudi 0.11.1
----------------
I am populating hudi tables on a hourly basis with insert operation.
The table had been backfilled with bulk-insert operation previously.
Since about 50 commits at least, the metadata is not cleaned/compacted
anymore.
As a result:
- reading the table gets very slow (> 100 merge files)
- I need to increase `spark.hadoop.fs.s3a.connection.maximum` otherwise it
fail (default = 50).
- the table cleaning fails (not metadata cleaning, but hudi table) because
of to many aws connection at the same time
As shown below, the commit timeline does not have any clean / compact commit
(only delta-commit).
Almost 500 files like this in `table/.hoodie/metadata/files`:
```
2022-06-22 09:36:36 22518
.files-0000_20220610072334049001.log.7_0-298-37469
2022-06-22 09:36:36 19557
.files-0000_20220610072334049001.log.80_0-1867-171224
2022-06-22 09:36:36 19557
.files-0000_20220610072334049001.log.81_0-1893-173673
2022-06-22 09:36:36 19384
.files-0000_20220610072334049001.log.82_0-1919-176124
2022-06-22 09:36:36 19286
.files-0000_20220610072334049001.log.83_0-1945-178573
2022-06-22 09:36:36 19765
.files-0000_20220610072334049001.log.84_0-1971-181030
2022-06-22 09:36:36 18787
.files-0000_20220610072334049001.log.85_0-1997-183481
2022-06-22 09:36:36 20506
.files-0000_20220610072334049001.log.86_0-2023-185952
2022-06-22 09:36:36 20287
.files-0000_20220610072334049001.log.87_0-2049-188407
2022-06-22 09:36:36 21197
.files-0000_20220610072334049001.log.88_0-2075-190878
2022-06-22 09:36:36 20361
.files-0000_20220610072334049001.log.89_0-2101-193345
2022-06-22 09:36:36 21946
.files-0000_20220610072334049001.log.8_0-333-41607
2022-06-22 09:36:36 20353
.files-0000_20220610072334049001.log.90_0-2127-195824
2022-06-22 09:36:36 21056
.files-0000_20220610072334049001.log.91_0-2153-198297
2022-06-22 09:36:36 20217
.files-0000_20220610072334049001.log.92_0-2179-200772
2022-06-22 09:36:36 19957
.files-0000_20220610072334049001.log.93_0-2205-203251
2022-06-22 09:36:36 19971
.files-0000_20220610072334049001.log.94_0-2231-205734
2022-06-22 09:36:36 20458
.files-0000_20220610072334049001.log.95_0-2257-208215
2022-06-22 09:36:36 19825
.files-0000_20220610072334049001.log.96_0-2283-210700
2022-06-22 09:36:36 19871
.files-0000_20220610072334049001.log.97_0-2309-213185
2022-06-22 09:36:36 20360
.files-0000_20220610072334049001.log.98_0-2335-215666
2022-06-22 09:36:36 20733
.files-0000_20220610072334049001.log.99_0-2361-218159
2022-06-22 09:36:36 21107
.files-0000_20220610072334049001.log.9_0-21-2705
2022-06-22 09:36:36 96 .hoodie_partition_metadata
2022-06-22 09:36:36 449928
files-0000_0-881-108328_20220610072334049001.hfile
```
This is the `table/.hoodie/metadata/.hoodie` folder, with the last
compaction (15 days ago) and only delta-commit since:
```
2022-06-22 09:36:13 0 20220609223827922001.compaction.inflight
2022-06-22 09:36:13 1962 20220609223827922001.compaction.requested
2022-06-22 09:36:13 7566 20220609223827922001.commit
....
2022-07-06 16:34:22 59081 20220706142608254.deltacommit
2022-07-06 17:03:59 0 20220706145535896.deltacommit.requested
2022-07-06 17:04:03 3047 20220706145535896.deltacommit.inflight
2022-07-06 17:04:07 59189 20220706145535896.deltacommit
2022-07-06 17:42:50 0 20220706153354902.deltacommit.requested
2022-07-06 17:42:53 3047 20220706153354902.deltacommit.inflight
2022-07-06 17:42:57 59297 20220706153354902.deltacommit
2022-07-06 18:05:02 0 20220706155551822.deltacommit.requested
2022-07-06 18:05:05 3047 20220706155551822.deltacommit.inflight
2022-07-06 18:05:09 59405 20220706155551822.deltacommit
2022-07-06 18:35:19 0 20220706162607724.deltacommit.requested
2022-07-06 18:35:22 3047 20220706162607724.deltacommit.inflight
2022-07-06 18:35:26 59513 20220706162607724.deltacommit
2022-07-06 18:46:12 0 20220706164400460.deltacommit.requested
2022-07-06 18:46:15 3049 20220706164400460.deltacommit.inflight
2022-07-06 18:46:19 59625 20220706164400460.deltacommit
2022-07-06 18:47:02 0 20220706164655555.deltacommit.requested
2022-07-06 18:47:04 548 20220706164655555.deltacommit.inflight
2022-07-06 18:47:05 6042 20220706164655555.deltacommit
2022-07-06 19:34:41 0 20220706172531401.deltacommit.requested
2022-07-06 19:34:44 3047 20220706172531401.deltacommit.inflight
2022-07-06 19:34:48 59729 20220706172531401.deltacommit
2022-07-06 21:48:19 0 20220706193852908.deltacommit.requested
2022-07-06 21:48:22 3047 20220706193852908.deltacommit.inflight
2022-07-06 21:48:26 59837 20220706193852908.deltacommit
2022-07-06 22:57:42 0 20220706204842179.deltacommit.requested
2022-07-06 22:57:45 3047 20220706204842179.deltacommit.inflight
2022-07-06 22:57:49 59945 20220706204842179.deltacommit
```
The workaround is to delete the metadata table (which result in next commit
create a new one from scratch, with less log files)
When reading the table, I get this, and almost 100k of rows like the below
(brand new decompressor) :
```
90836 [Driver] INFO
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader - Merging the
final data blocks
90836 [Driver] INFO
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader - Number of
remaining logblocks to merge 109
90935 [Driver] INFO org.apache.hadoop.io.compress.CodecPool - Got
brand-new decompressor [.gz]
90935 [Driver] INFO org.apache.hadoop.io.compress.CodecPool - Got
brand-new decompressor [.gz]
90935 [Driver] INFO org.apache.hadoop.io.compress.CodecPool - Got
brand-new decompressor [.gz]
90935 [Driver] INFO org.apache.hadoop.io.compress.CodecPool - Got
brand-new decompressor [.gz]
...
```
here are the hudi configs I use to insert the data:
```
{"hoodie.datasource.write.operation", "INSERT"},
{"hoodie.parquet.compression.codec", "zstd"},
{"hoodie.datasource.write.row.writer.enable", "false"},
{"hoodie.bulkinsert.sort.mode", "NONE"},
{"hoodie.embed.timeline.server", "false"},
{"hoodie.embed.timeline.server.async", "true"},
{"hoodie.client.heartbeat.interval_in_ms", "240000"},
{"hoodie.write.markers.type", "DIRECT"},
{"hoodie.avro.schema.validate", "false"},
{"hoodie.clean.async", "false"},
{"hoodie.clean.automatic", "false"},
{"hoodie.clean.allow.multiple", "false"}.
{"hoodie.cleaner.parallelism", "200"},
{"hoodie.clean.max.commits", "1"},
{ "hoodie.cleaner.commits.retained", "24"},
{ "hoodie.keep.min.commits", "199"},
{ "hoodie.keep.max.commits", "200"},
{"hoodie.archive.async", "false"},
{"hoodie.archive.automatic", "true"},
{"hoodie.archive.merge.enable", "true"},
{"compaction.schedule.enabled", "true"},
{ "hoodie.compact.inline", "false"},
{"hoodie.compact.inline.max.delta.commits", "24"},
{"index.global.enabled", "true"},
{"hoodie.bloom.index.prune.by.ranges", "true"},
{"hoodie.bloom.index.use.caching", "true"}
{"hoodie.metadata.enable", "true"}
```
How to restore smooth compaction on metadata table ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]