parisni opened a new issue, #6056:
URL: https://github.com/apache/hudi/issues/6056

   spark 3.2.1
   hudi 0.11.1
   ----------------
   
   I am populating hudi tables on a hourly basis with insert operation. 
   The table had been backfilled with bulk-insert operation previously.
   
   Since about 50 commits at least, the metadata is not cleaned/compacted 
anymore.
   
   As a result:
   -  reading the table gets very slow (> 100 merge files) 
   - I need to increase `spark.hadoop.fs.s3a.connection.maximum` otherwise it 
fail (default = 50).
   - the table cleaning fails (not metadata cleaning, but hudi table) because 
of to many aws connection at the same time
   
   
   As shown below, the commit timeline does not have any clean / compact commit 
(only delta-commit).
   
   Almost 500 files like this in `table/.hoodie/metadata/files`:
   ```
   2022-06-22 09:36:36      22518 
.files-0000_20220610072334049001.log.7_0-298-37469
   2022-06-22 09:36:36      19557 
.files-0000_20220610072334049001.log.80_0-1867-171224
   2022-06-22 09:36:36      19557 
.files-0000_20220610072334049001.log.81_0-1893-173673
   2022-06-22 09:36:36      19384 
.files-0000_20220610072334049001.log.82_0-1919-176124
   2022-06-22 09:36:36      19286 
.files-0000_20220610072334049001.log.83_0-1945-178573
   2022-06-22 09:36:36      19765 
.files-0000_20220610072334049001.log.84_0-1971-181030
   2022-06-22 09:36:36      18787 
.files-0000_20220610072334049001.log.85_0-1997-183481
   2022-06-22 09:36:36      20506 
.files-0000_20220610072334049001.log.86_0-2023-185952
   2022-06-22 09:36:36      20287 
.files-0000_20220610072334049001.log.87_0-2049-188407
   2022-06-22 09:36:36      21197 
.files-0000_20220610072334049001.log.88_0-2075-190878
   2022-06-22 09:36:36      20361 
.files-0000_20220610072334049001.log.89_0-2101-193345
   2022-06-22 09:36:36      21946 
.files-0000_20220610072334049001.log.8_0-333-41607
   2022-06-22 09:36:36      20353 
.files-0000_20220610072334049001.log.90_0-2127-195824
   2022-06-22 09:36:36      21056 
.files-0000_20220610072334049001.log.91_0-2153-198297
   2022-06-22 09:36:36      20217 
.files-0000_20220610072334049001.log.92_0-2179-200772
   2022-06-22 09:36:36      19957 
.files-0000_20220610072334049001.log.93_0-2205-203251
   2022-06-22 09:36:36      19971 
.files-0000_20220610072334049001.log.94_0-2231-205734
   2022-06-22 09:36:36      20458 
.files-0000_20220610072334049001.log.95_0-2257-208215
   2022-06-22 09:36:36      19825 
.files-0000_20220610072334049001.log.96_0-2283-210700
   2022-06-22 09:36:36      19871 
.files-0000_20220610072334049001.log.97_0-2309-213185
   2022-06-22 09:36:36      20360 
.files-0000_20220610072334049001.log.98_0-2335-215666
   2022-06-22 09:36:36      20733 
.files-0000_20220610072334049001.log.99_0-2361-218159
   2022-06-22 09:36:36      21107 
.files-0000_20220610072334049001.log.9_0-21-2705
   2022-06-22 09:36:36         96 .hoodie_partition_metadata
   2022-06-22 09:36:36     449928 
files-0000_0-881-108328_20220610072334049001.hfile
   ```
   
   This is the `table/.hoodie/metadata/.hoodie` folder, with the last 
compaction (15 days ago) and only delta-commit since:
   ```
   2022-06-22 09:36:13          0 20220609223827922001.compaction.inflight
   2022-06-22 09:36:13       1962 20220609223827922001.compaction.requested
   2022-06-22 09:36:13       7566 20220609223827922001.commit
   
   ....
   
   2022-07-06 16:34:22      59081 20220706142608254.deltacommit
   2022-07-06 17:03:59          0 20220706145535896.deltacommit.requested
   2022-07-06 17:04:03       3047 20220706145535896.deltacommit.inflight
   2022-07-06 17:04:07      59189 20220706145535896.deltacommit
   2022-07-06 17:42:50          0 20220706153354902.deltacommit.requested
   2022-07-06 17:42:53       3047 20220706153354902.deltacommit.inflight
   2022-07-06 17:42:57      59297 20220706153354902.deltacommit
   2022-07-06 18:05:02          0 20220706155551822.deltacommit.requested
   2022-07-06 18:05:05       3047 20220706155551822.deltacommit.inflight
   2022-07-06 18:05:09      59405 20220706155551822.deltacommit
   2022-07-06 18:35:19          0 20220706162607724.deltacommit.requested
   2022-07-06 18:35:22       3047 20220706162607724.deltacommit.inflight
   2022-07-06 18:35:26      59513 20220706162607724.deltacommit
   2022-07-06 18:46:12          0 20220706164400460.deltacommit.requested
   2022-07-06 18:46:15       3049 20220706164400460.deltacommit.inflight
   2022-07-06 18:46:19      59625 20220706164400460.deltacommit
   2022-07-06 18:47:02          0 20220706164655555.deltacommit.requested
   2022-07-06 18:47:04        548 20220706164655555.deltacommit.inflight
   2022-07-06 18:47:05       6042 20220706164655555.deltacommit
   2022-07-06 19:34:41          0 20220706172531401.deltacommit.requested
   2022-07-06 19:34:44       3047 20220706172531401.deltacommit.inflight
   2022-07-06 19:34:48      59729 20220706172531401.deltacommit
   2022-07-06 21:48:19          0 20220706193852908.deltacommit.requested
   2022-07-06 21:48:22       3047 20220706193852908.deltacommit.inflight
   2022-07-06 21:48:26      59837 20220706193852908.deltacommit
   2022-07-06 22:57:42          0 20220706204842179.deltacommit.requested
   2022-07-06 22:57:45       3047 20220706204842179.deltacommit.inflight
   2022-07-06 22:57:49      59945 20220706204842179.deltacommit
   
   ```
   
   The workaround is to delete the metadata table (which result in next commit 
create a new one from scratch, with less log files)
   
   When reading the table, I get this, and almost 100k of rows like the below 
(brand new decompressor) :
   ```
   90836 [Driver] INFO  
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader  - Merging the 
final data blocks
   90836 [Driver] INFO  
org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader  - Number of 
remaining logblocks to merge 109
   90935 [Driver] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
   90935 [Driver] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
   90935 [Driver] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
   90935 [Driver] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
   ...
   ```
   
   here are the hudi configs I use to insert the data:
   
   ```
                     {"hoodie.datasource.write.operation", "INSERT"},
                     {"hoodie.parquet.compression.codec", "zstd"},
                     {"hoodie.datasource.write.row.writer.enable", "false"},
                     {"hoodie.bulkinsert.sort.mode", "NONE"},
                     {"hoodie.embed.timeline.server", "false"},
                     {"hoodie.embed.timeline.server.async", "true"},
                     {"hoodie.client.heartbeat.interval_in_ms", "240000"},
                     {"hoodie.write.markers.type", "DIRECT"},
                     {"hoodie.avro.schema.validate", "false"},
                     {"hoodie.clean.async", "false"},
                     {"hoodie.clean.automatic", "false"}, 
                     {"hoodie.clean.allow.multiple", "false"}.
                     {"hoodie.cleaner.parallelism", "200"},
                     {"hoodie.clean.max.commits", "1"}, 
                     { "hoodie.cleaner.commits.retained", "24"},
                     { "hoodie.keep.min.commits", "199"},
                     { "hoodie.keep.max.commits", "200"},
                     {"hoodie.archive.async", "false"},
                     {"hoodie.archive.automatic", "true"},
                     {"hoodie.archive.merge.enable", "true"},
                     {"compaction.schedule.enabled", "true"},
                     { "hoodie.compact.inline", "false"},
                     {"hoodie.compact.inline.max.delta.commits", "24"},
                     {"index.global.enabled", "true"},
                     {"hoodie.bloom.index.prune.by.ranges", "true"}, 
                     {"hoodie.bloom.index.use.caching", "true"} 
                     {"hoodie.metadata.enable", "true"} 
   ```
   
   How to restore smooth compaction on metadata table ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to