stackfun opened a new issue #2672:
URL: https://github.com/apache/hudi/issues/2672
**Describe the problem you faced**
After inserting a few hundred million records, upserts started to hang. I
added `"hoodie.simple.index.input.storage.level": "DISK_ONLY"` and
`"hoodie.write.status.storage.level": "DISK_ONLY"` properties and the hang went
away temporarily.
After inserting a little over a billion records, upserts started to hang
again.
Our keys are random (a sha256 hash) and therefore we chose to use the SIMPLE
index type. We have a lot of identical & updated records being ingested, so the
majority of our partitions are updated on each write. During this run, there
was no compaction that was scheduled.
**To Reproduce**
Here is our hudi config.
```
{
"hoodie.table.name": table_name,
"hoodie.datasource.write.hive_style_partitioning": True,
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.partitionpath.field": "partitionpath",
"hoodie.datasource.write.recordkey.field": "sha256",
"hoodie.datasource.write.table.name": table_name,
"hoodie.datasource.write.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.compaction.async.enable": True,
"hoodie.index.type": "SIMPLE",
"hoodie.compact.inline": True,
"hoodie.clean.async": True,
'hoodie.clean.automatic': True,
"hoodie.simple.index.parallelism": "100",
"hoodie.simple.index.input.storage.level": "DISK_ONLY",
"hoodie.write.status.storage.level": "DISK_ONLY",
'hoodie.cleaner.commits.retained': 6,
"hoodie.compact.inline.max.delta.commits": 8,
"hoodie.datasource.hive_sync.enable": True,
"hoodie.datasource.hive_sync.database": hive_db,
"hoodie.datasource.hive_sync.table": table_name,
"hoodie.datasource.hive_sync.username": "hive",
"hoodie.datasource.hive_sync.password": "hive",
"hoodie.datasource.hive_sync.jdbcurl": jdbc_url,
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.partition_fields": "partitionpath",
}
```
Here is our spark config
```
{
"spark.executor.instances": "63",
"spark.executor.cores": "1",
"spark.executor.memory": "5g",
"spark.yarn.am.cores": "1",
"spark.driver.maxResultSize": "3g",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.hive.convertMetastoreParquet": "false",
"spark.task.cpus": "1",
"spark.task.maxFailures": "4",
"spark.driver.memoryOverhead": "2048",
"spark.executor.memoryOverhead": "3072",
"spark.yarn.max.executor.failures": "100",
"spark.memory.fraction": "0.2",
"spark.memory.storageFraction": "0.2",
}
```
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version : Tested on 0.6.0 and 0.7.0
* Spark version : 2.4.7
* Hive version : 2.3.7
* Hadoop version : 2.9.2
* Storage (HDFS/S3/GCS..) : GCS
* Running on Docker? (yes/no) : no
* GCP Dataproc: 1.4
**Additional context**
Logs (removed lots of duplicate lines)
```
21/03/10 21:27:17 INFO org.apache.hudi.common.table.HoodieTableConfig:
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
21/03/10 21:27:17 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET)
from gs://my_bucket/my_table
...
21/03/10 21:27:21 INFO org.apache.hudi.HoodieSparkSqlWriter$:
Config.isInlineCompaction ? true
21/03/10 21:27:53 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading HoodieTableMetaClient from gs://my_bucket/my_table
21/03/10 21:27:53 INFO org.apache.hudi.common.fs.FSUtils: Hadoop
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m],
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml, ...
21/03/10 21:27:53 INFO org.apache.hudi.common.table.HoodieTableConfig:
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
21/03/10 21:27:54 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET)
from gs://my_bucket/my_table
21/03/10 21:27:54 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading Active commit timeline for gs://my_bucket/my_table
21/03/10 21:27:55 INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants
[[20210302001642__rollback__COMPLETED],
[20210309013900__deltacommit__COMPLETED], [20210309022132__clean__COMPLETED],
[20210309035721__deltacommit__COMPLETED], ...
21/03/10 21:27:55 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading HoodieTableMetaClient from gs://my_bucket/my_table
(repeats)
...
21/03/10 21:28:08 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata:
Metadata table is disabled.
21/03/10 21:28:08 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata:
Metadata table is disabled.
21/03/10 21:28:08 INFO org.apache.hudi.client.SparkRDDWriteClient:
Successfully synced to metadata table
21/03/10 21:28:08 INFO org.apache.hudi.client.AsyncCleanerService: Auto
cleaning is enabled. Running cleaner async to write operation at instant time
20210310212808
21/03/10 21:28:08 INFO org.apache.hudi.async.HoodieAsyncService: Submitting
monitor thread !!
21/03/10 21:28:08 INFO org.apache.hudi.client.AbstractHoodieWriteClient:
Cleaner started
21/03/10 21:28:08 INFO org.apache.hudi.client.AbstractHoodieWriteClient:
Cleaned failed attempts if any
21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading HoodieTableMetaClient from gs://my_bucket/my_table
21/03/10 21:28:08 INFO org.apache.hudi.common.fs.FSUtils: Hadoop
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m],
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml, ...
21/03/10 21:28:08 INFO org.apache.hudi.async.HoodieAsyncService: Monitoring
thread(s) !!
21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableConfig:
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET)
from gs://my_bucket/my_table
21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading Active commit timeline for gs://my_bucket/my_table
21/03/10 21:28:09 INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants
[[20210302001642__rollback__COMPLETED],
[20210309013900__deltacommit__COMPLETED], [20210309022132__clean__COMPLETED],
[20210309035721__deltacommit__COMPLETED], [20210309040049__commit__COMPLETED],
[20210309041929__deltacommit__COMPLETED], [20210309042254__commit__COMPLETED],
[20210309054850__clean__COMPLETED], ...
21/03/10 21:28:09 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading HoodieTableMetaClient from gs://my_bucket/my_table
21/03/10 21:28:09 INFO org.apache.hudi.common.fs.FSUtils: Hadoop
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m],
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml, __spark_hadoop_conf__.xml,
file:/etc/hive/conf.dist/hive-site.xml], FileSystem:
[com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@185853f5]
21/03/10 21:28:09 INFO org.apache.hudi.common.table.HoodieTableConfig:
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET)
from gs://my_bucket/my_table
21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading HoodieTableMetaClient from gs://my_bucket/my_table/.hoodie/metadata
21/03/10 21:28:10 INFO org.apache.hudi.common.fs.FSUtils: Hadoop
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m],
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml,
mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml,
hdfs-site.xml, __spark_hadoop_conf__.xml,
file:/etc/hive/conf.dist/hive-site.xml], FileSystem:
[com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@185853f5]
21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableConfig:
Loading table properties from
gs://my_bucket/my_table/.hoodie/metadata/.hoodie/hoodie.properties
21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE)
from gs://my_bucket/my_table/.hoodie/metadata
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants
[[20210309121838__deltacommit__COMPLETED],
[20210310182547__deltacommit__COMPLETED],
[20210310184634__deltacommit__COMPLETED],
[20210310190731__deltacommit__COMPLETED],
[20210310202059__deltacommit__COMPLETED],
[20210310204636__deltacommit__COMPLETED],
[20210310210716__deltacommit__COMPLETED],
[20210310210814__deltacommit__COMPLETED]]
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 1 ms to
read 0 instants, 0 replaced file groups
21/03/10 21:28:10 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0
files in pending clustering operations
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Building file
system view for partition (files)
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in
partition (files) =9, Time taken =39
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.view.HoodieTableFileSystemView: Adding file-groups
for partition :files, #FileGroups=1
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: addFilesToView:
NumFiles=9, NumFileGroups=1, FileGroupsCreationTime=2, StoreTimeTaken=1
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Time to load
partition (files) =77
21/03/10 21:28:10 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Pending
Compaction instant for (FileSlice
{fileGroupId=HoodieFileGroupId{partitionPath='files',
fileId='0b674bb4-097f-41b7-a3ea-82872cf6056d-0'},
baseCommitTime=20210309121838,
baseFile='HoodieBaseFile{fullPath=gs://my_bucket/my_table/.hoodie/metadata/files/0b674bb4-097f-41b7-a3ea-82872cf6056d-0_0-10-463_20210309121838.hfile,
...
...
21/03/10 21:28:35 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata:
Opened metadata base file from
gs://my_bucket/my_table/.hoodie/metadata/files/0b674bb4-097f-41b7-a3ea-82872cf6056d-0_0-10-463_20210309121838.hfile
at instant 20210309121838
21/03/10 21:28:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient:
Loading HoodieTableMetaClient from gs://my_bucket/my_table/.hoodie/metadata
...
21/03/10 21:28:37 INFO
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Number of log
files scanned => 7
21/03/10 21:28:37 INFO
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: MaxMemoryInBytes
allowed for compaction => 0
21/03/10 21:28:37 INFO
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Number of
entries in MemoryBasedMap in ExternalSpillableMap => 192
21/03/10 21:28:37 INFO
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Total size in
bytes of MemoryBasedMap in ExternalSpillableMap => 1493000
21/03/10 21:28:37 INFO
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Number of
entries in DiskBasedMap in ExternalSpillableMap => 0
21/03/10 21:28:37 INFO
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Size of file
spilled to disk => 0
21/03/10 21:28:37 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata:
Opened metadata log files from
[gs://my_bucket/my_table/.hoodie/metadata/files/.0b674bb4-097f-41b7-a3ea-82872cf6056d-0_20210309121838.log.1_0-49-32789,
...] at instant 20210310210814(dataset instant=20210310210814, metadata
instant=20210310210814)
....
1/03/10 21:28:37 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata:
Metadata read for key 2018-02-01 took [open, baseFileRead, logMerge] [1413, 72,
13] ms
21/03/10 21:28:37 INFO org.apache.hudi.metadata.BaseTableMetadata: Listed
file in partition from metadata: partition=2018-12, #files=119
21/03/10 21:28:37 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in
partition (2018-12) =119, Time taken =1505
21/03/10 21:28:37 INFO org.apache.hudi.metadata.BaseTableMetadata: Listed
file in partition from metadata: partition=2021-09, #files=165
...
21/03/10 21:32:25 INFO
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor: Workload
profile :WorkloadProfile {globalStat=WorkloadStat {numInserts=3948199,
numUpdates=2751299}, partitionStat={...
21/03/10 21:32:25 INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Checking for file
exists ?gs://my_bucket/my_table/.hoodie/20210310212717.deltacommit.requested
21/03/10 21:32:26 INFO
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Create new file for
toInstant ?gs://my_bucket/my_table/.hoodie/20210310212717.deltacommit.inflight
21/03/10 21:32:26 INFO
org.apache.hudi.table.action.commit.UpsertPartitioner: AvgRecordSize => 899
...
21/03/10 21:32:28 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0
files in pending clustering operations
21/03/10 21:32:28 INFO
org.apache.hudi.table.action.commit.UpsertPartitioner: For partitionPath :
2018-02-01 Small Files => [SmallFile {location=HoodieRecordLocation
{instantTime=20210309070226, fileId=7c8774cf-f257-472e-9c8b-8b6d76410e0b-0},
sizeBytes=464424}]
21/03/10 21:32:28 INFO
org.apache.hudi.table.action.commit.UpsertPartitioner: Assigning 1 inserts to
new update bucket 2852
21/03/10 21:32:28 INFO
org.apache.hudi.table.action.commit.UpsertPartitioner: Total insert buckets for
partition path 2012-01 => [(InsertBucket {bucketNumber=2852, weight=1.0},1.0)]
(repeats severl times)
...
21/03/10 21:32:28 INFO
org.apache.hudi.table.action.commit.UpsertPartitioner: Total Buckets :2853,
buckets info => {0=BucketInfo {bucketType=UPDATE,
fileIdPrefix=b5d67fc9-5e56-4a86-bacd-e9d319e41131-0, partitionPath=2020-01-1},
...
Partition to insert buckets => {2018-03-01=[(InsertBucket
{bucketNumber=2755, weight=1.0},1.0)], ...
UpdateLocations mapped to buckets
=>{2de711c6-a290-42bd-b0d9-6ffb5561ee20-0=2611,
0e5f615a-3d52-4061-8203-5fb604cf6d28-1=2824, ...
21/03/10 21:32:28 INFO
org.apache.hudi.table.action.commit.BaseCommitActionExecutor: Auto commit
disabled for 20210310212717
21/03/10 21:32:34 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Pending
Compaction instant for (FileSlice
{fileGroupId=HoodieFileGroupId{partitionPath='2018-02-01',
fileId='f8f1bdf0-697c-4694-ad75-96c7c8a7e930-0'},
baseCommitTime=20210309100248, ...
21/03/10 21:32:34 INFO org.apache.hudi.timeline.service.RequestHandler:
TimeTakenMillis[Total=3, Refresh=0, handle=2, Check=1], Success=true, Query=...
21/03/10 21:32:34 INFO
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Pending
Compaction instant for (FileSlice
{fileGroupId=HoodieFileGroupId{partitionPath='2018-02-02',
fileId='03f5ae32-46b9-4291-9b7c-5f918d18c9b2-0'},
baseCommitTime=20210309100248, ...
(repeats hundreds of times)
...
1/03/10 21:33:08 INFO org.apache.hudi.timeline.service.RequestHandler:
TimeTakenMillis[Total=7, Refresh=0, handle=6, Check=1], Success=true, Query=...
21/03/10 21:33:08 INFO org.apache.hudi.timeline.service.RequestHandler:
TimeTakenMillis[Total=4, Refresh=1, handle=3, Check=0], Success=true, Query=...
21/03/10 21:33:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache:
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114,
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0,
cachingHitsRatio=0,evictions=29, evicted=0, evictedPerRun=0.0
21/03/10 21:38:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache:
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114,
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0,
cachingHitsRatio=0,evictions=59, evicted=0, evictedPerRun=0.0
21/03/10 21:43:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache:
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114,
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0,
cachingHitsRatio=0,evictions=89, evicted=0, evictedPerRun=0.0
21/03/10 21:48:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache:
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114,
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0,
cachingHitsRatio=0,evictions=119, evicted=0, evictedPerRun=0.0
21/03/10 21:53:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache:
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114,
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0,
cachingHitsRatio=0,evictions=149, evicted=0, evictedPerRun=0.0
(repeats forever until spark job timeout)
```
Compared to the other hudi writes that succeeded, this job that hanged does
**not** produce an output like
`INFO org.apache.hudi.HoodieSparkSqlWriter$: No errors. Proceeding to commit
the write.`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]