[GitHub] [hudi] stackfun opened a new issue #2672: [SUPPORT] Hang during MOR Upsert after a billion records

GitBox Sat, 13 Mar 2021 18:28:34 -0800


stackfun opened a new issue #2672:
URL: https://github.com/apache/hudi/issues/2672



   **Describe the problem you faced**
   
   After inserting a few hundred million records, upserts started to hang. I 
added `"hoodie.simple.index.input.storage.level": "DISK_ONLY"` and 
`"hoodie.write.status.storage.level": "DISK_ONLY"` properties and the hang went 
away temporarily.
   After inserting a little over a billion records, upserts started to hang 
again. 
   
   Our keys are random (a sha256 hash) and therefore we chose to use the SIMPLE 
index type. We have a lot of identical & updated records being ingested, so the 
majority of our partitions are updated on each write. During this run, there 
was no compaction that was scheduled.
   
   **To Reproduce**
   
   Here is our hudi config.
   ```
   {
       "hoodie.table.name": table_name,
       "hoodie.datasource.write.hive_style_partitioning": True,
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.datasource.write.partitionpath.field": "partitionpath",
       "hoodie.datasource.write.recordkey.field": "sha256",
       "hoodie.datasource.write.table.name": table_name,
       "hoodie.datasource.write.table.type": "MERGE_ON_READ",
       "hoodie.datasource.write.precombine.field": "ts",
       "hoodie.datasource.compaction.async.enable": True, 
       "hoodie.index.type": "SIMPLE",
       "hoodie.compact.inline": True,
       "hoodie.clean.async": True,
       'hoodie.clean.automatic': True,
       "hoodie.simple.index.parallelism": "100", 
       "hoodie.simple.index.input.storage.level": "DISK_ONLY",
       "hoodie.write.status.storage.level": "DISK_ONLY",
       'hoodie.cleaner.commits.retained': 6,
       "hoodie.compact.inline.max.delta.commits": 8,
       "hoodie.datasource.hive_sync.enable": True,
       "hoodie.datasource.hive_sync.database": hive_db,
       "hoodie.datasource.hive_sync.table": table_name,
       "hoodie.datasource.hive_sync.username": "hive",
       "hoodie.datasource.hive_sync.password": "hive",
       "hoodie.datasource.hive_sync.jdbcurl": jdbc_url,
       "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
       "hoodie.datasource.hive_sync.partition_fields": "partitionpath",
   }
   ```
   
   Here is our spark config
   ```
   {
       "spark.executor.instances": "63",
       "spark.executor.cores": "1",
       "spark.executor.memory": "5g",
       "spark.yarn.am.cores": "1",
       "spark.driver.maxResultSize": "3g",
       "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
       "spark.sql.hive.convertMetastoreParquet": "false",
       "spark.task.cpus": "1",
       "spark.task.maxFailures": "4",
       "spark.driver.memoryOverhead": "2048",
       "spark.executor.memoryOverhead": "3072",
       "spark.yarn.max.executor.failures": "100",
       "spark.memory.fraction": "0.2", 
       "spark.memory.storageFraction": "0.2",
   }
   ```
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : Tested on 0.6.0 and 0.7.0
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.9.2
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : no
   
   * GCP Dataproc: 1.4
   
   
   **Additional context**
   
   Logs (removed lots of duplicate lines)
   ```
   21/03/10 21:27:17 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
   21/03/10 21:27:17 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) 
from gs://my_bucket/my_table
   ...
   21/03/10 21:27:21 INFO org.apache.hudi.HoodieSparkSqlWriter$: 
Config.isInlineCompaction ? true
   21/03/10 21:27:53 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from gs://my_bucket/my_table
   21/03/10 21:27:53 INFO org.apache.hudi.common.fs.FSUtils: Hadoop 
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m], 
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml, yarn-default.xml, yarn-site.xml, ...
   21/03/10 21:27:53 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
   21/03/10 21:27:54 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) 
from gs://my_bucket/my_table
   21/03/10 21:27:54 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading Active commit timeline for gs://my_bucket/my_table
   21/03/10 21:27:55 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
[[20210302001642__rollback__COMPLETED], 
[20210309013900__deltacommit__COMPLETED], [20210309022132__clean__COMPLETED], 
[20210309035721__deltacommit__COMPLETED], ...
   21/03/10 21:27:55 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from gs://my_bucket/my_table
   (repeats)
   ...
   21/03/10 21:28:08 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata: 
Metadata table is disabled.
   21/03/10 21:28:08 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata: 
Metadata table is disabled.
   21/03/10 21:28:08 INFO org.apache.hudi.client.SparkRDDWriteClient: 
Successfully synced to metadata table
   21/03/10 21:28:08 INFO org.apache.hudi.client.AsyncCleanerService: Auto 
cleaning is enabled. Running cleaner async to write operation at instant time 
20210310212808
   21/03/10 21:28:08 INFO org.apache.hudi.async.HoodieAsyncService: Submitting 
monitor thread !!
   21/03/10 21:28:08 INFO org.apache.hudi.client.AbstractHoodieWriteClient: 
Cleaner started
   21/03/10 21:28:08 INFO org.apache.hudi.client.AbstractHoodieWriteClient: 
Cleaned failed attempts if any
   21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from gs://my_bucket/my_table
   21/03/10 21:28:08 INFO org.apache.hudi.common.fs.FSUtils: Hadoop 
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m], 
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml, yarn-default.xml, yarn-site.xml, ...
   21/03/10 21:28:08 INFO org.apache.hudi.async.HoodieAsyncService: Monitoring 
thread(s) !!
   21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
   21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) 
from gs://my_bucket/my_table
   21/03/10 21:28:08 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading Active commit timeline for gs://my_bucket/my_table
   21/03/10 21:28:09 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
[[20210302001642__rollback__COMPLETED], 
[20210309013900__deltacommit__COMPLETED], [20210309022132__clean__COMPLETED], 
[20210309035721__deltacommit__COMPLETED], [20210309040049__commit__COMPLETED], 
[20210309041929__deltacommit__COMPLETED], [20210309042254__commit__COMPLETED], 
[20210309054850__clean__COMPLETED], ...
   21/03/10 21:28:09 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from gs://my_bucket/my_table
   21/03/10 21:28:09 INFO org.apache.hudi.common.fs.FSUtils: Hadoop 
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m], 
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
hdfs-site.xml, __spark_hadoop_conf__.xml, 
file:/etc/hive/conf.dist/hive-site.xml], FileSystem: 
[com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@185853f5]
   21/03/10 21:28:09 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from gs://my_bucket/my_table/.hoodie/hoodie.properties
   21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) 
from gs://my_bucket/my_table
   21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from gs://my_bucket/my_table/.hoodie/metadata
   21/03/10 21:28:10 INFO org.apache.hudi.common.fs.FSUtils: Hadoop 
Configuration: fs.defaultFS: [hdfs://hudi-write-20210309-110000-m], 
Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, 
mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, 
hdfs-site.xml, __spark_hadoop_conf__.xml, 
file:/etc/hive/conf.dist/hive-site.xml], FileSystem: 
[com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem@185853f5]
   21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableConfig: 
Loading table properties from 
gs://my_bucket/my_table/.hoodie/metadata/.hoodie/hoodie.properties
   21/03/10 21:28:10 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) 
from gs://my_bucket/my_table/.hoodie/metadata
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants 
[[20210309121838__deltacommit__COMPLETED], 
[20210310182547__deltacommit__COMPLETED], 
[20210310184634__deltacommit__COMPLETED], 
[20210310190731__deltacommit__COMPLETED], 
[20210310202059__deltacommit__COMPLETED], 
[20210310204636__deltacommit__COMPLETED], 
[20210310210716__deltacommit__COMPLETED], 
[20210310210814__deltacommit__COMPLETED]]
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Took 1 ms to 
read  0 instants, 0 replaced file groups
   21/03/10 21:28:10 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Building file 
system view for partition (files)
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in 
partition (files) =9, Time taken =39
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.view.HoodieTableFileSystemView: Adding file-groups 
for partition :files, #FileGroups=1
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: addFilesToView: 
NumFiles=9, NumFileGroups=1, FileGroupsCreationTime=2, StoreTimeTaken=1
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Time to load 
partition (files) =77
   21/03/10 21:28:10 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Pending 
Compaction instant for (FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='files', 
fileId='0b674bb4-097f-41b7-a3ea-82872cf6056d-0'}, 
baseCommitTime=20210309121838, 
baseFile='HoodieBaseFile{fullPath=gs://my_bucket/my_table/.hoodie/metadata/files/0b674bb4-097f-41b7-a3ea-82872cf6056d-0_0-10-463_20210309121838.hfile,
 ... 
   ...
   21/03/10 21:28:35 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata: 
Opened metadata base file from 
gs://my_bucket/my_table/.hoodie/metadata/files/0b674bb4-097f-41b7-a3ea-82872cf6056d-0_0-10-463_20210309121838.hfile
 at instant 20210309121838
   21/03/10 21:28:35 INFO org.apache.hudi.common.table.HoodieTableMetaClient: 
Loading HoodieTableMetaClient from gs://my_bucket/my_table/.hoodie/metadata
   ...
   21/03/10 21:28:37 INFO 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Number of log 
files scanned => 7
   21/03/10 21:28:37 INFO 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: MaxMemoryInBytes 
allowed for compaction => 0
   21/03/10 21:28:37 INFO 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 192
   21/03/10 21:28:37 INFO 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1493000
   21/03/10 21:28:37 INFO 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Number of 
entries in DiskBasedMap in ExternalSpillableMap => 0
   21/03/10 21:28:37 INFO 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner: Size of file 
spilled to disk => 0
   21/03/10 21:28:37 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata: 
Opened metadata log files from 
[gs://my_bucket/my_table/.hoodie/metadata/files/.0b674bb4-097f-41b7-a3ea-82872cf6056d-0_20210309121838.log.1_0-49-32789,
 ...] at instant 20210310210814(dataset instant=20210310210814, metadata 
instant=20210310210814)
   ....
   1/03/10 21:28:37 INFO org.apache.hudi.metadata.HoodieBackedTableMetadata: 
Metadata read for key 2018-02-01 took [open, baseFileRead, logMerge] [1413, 72, 
13] ms
   21/03/10 21:28:37 INFO org.apache.hudi.metadata.BaseTableMetadata: Listed 
file in partition from metadata: partition=2018-12, #files=119
   21/03/10 21:28:37 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: #files found in 
partition (2018-12) =119, Time taken =1505
   21/03/10 21:28:37 INFO org.apache.hudi.metadata.BaseTableMetadata: Listed 
file in partition from metadata: partition=2021-09, #files=165
   ...
   21/03/10 21:32:25 INFO 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor: Workload 
profile :WorkloadProfile {globalStat=WorkloadStat {numInserts=3948199, 
numUpdates=2751299}, partitionStat={...
   21/03/10 21:32:25 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Checking for file 
exists ?gs://my_bucket/my_table/.hoodie/20210310212717.deltacommit.requested
   21/03/10 21:32:26 INFO 
org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Create new file for 
toInstant ?gs://my_bucket/my_table/.hoodie/20210310212717.deltacommit.inflight
   21/03/10 21:32:26 INFO 
org.apache.hudi.table.action.commit.UpsertPartitioner: AvgRecordSize => 899
   ...
   21/03/10 21:32:28 INFO org.apache.hudi.common.util.ClusteringUtils: Found 0 
files in pending clustering operations
   21/03/10 21:32:28 INFO 
org.apache.hudi.table.action.commit.UpsertPartitioner: For partitionPath : 
2018-02-01 Small Files => [SmallFile {location=HoodieRecordLocation 
{instantTime=20210309070226, fileId=7c8774cf-f257-472e-9c8b-8b6d76410e0b-0}, 
sizeBytes=464424}]
   21/03/10 21:32:28 INFO 
org.apache.hudi.table.action.commit.UpsertPartitioner: Assigning 1 inserts to 
new update bucket 2852
   21/03/10 21:32:28 INFO 
org.apache.hudi.table.action.commit.UpsertPartitioner: Total insert buckets for 
partition path 2012-01 => [(InsertBucket {bucketNumber=2852, weight=1.0},1.0)]
   (repeats severl times)
   ...
   21/03/10 21:32:28 INFO 
org.apache.hudi.table.action.commit.UpsertPartitioner: Total Buckets :2853, 
buckets info => {0=BucketInfo {bucketType=UPDATE, 
fileIdPrefix=b5d67fc9-5e56-4a86-bacd-e9d319e41131-0, partitionPath=2020-01-1}, 
...
   Partition to insert buckets => {2018-03-01=[(InsertBucket 
{bucketNumber=2755, weight=1.0},1.0)], ...
   UpdateLocations mapped to buckets 
=>{2de711c6-a290-42bd-b0d9-6ffb5561ee20-0=2611, 
0e5f615a-3d52-4061-8203-5fb604cf6d28-1=2824, ...
   21/03/10 21:32:28 INFO 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor: Auto commit 
disabled for 20210310212717
   21/03/10 21:32:34 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Pending 
Compaction instant for (FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='2018-02-01', 
fileId='f8f1bdf0-697c-4694-ad75-96c7c8a7e930-0'}, 
baseCommitTime=20210309100248, ...
   21/03/10 21:32:34 INFO org.apache.hudi.timeline.service.RequestHandler: 
TimeTakenMillis[Total=3, Refresh=0, handle=2, Check=1], Success=true, Query=...
   21/03/10 21:32:34 INFO 
org.apache.hudi.common.table.view.AbstractTableFileSystemView: Pending 
Compaction instant for (FileSlice 
{fileGroupId=HoodieFileGroupId{partitionPath='2018-02-02', 
fileId='03f5ae32-46b9-4291-9b7c-5f918d18c9b2-0'}, 
baseCommitTime=20210309100248, ...
   (repeats hundreds of times)
   ...
   1/03/10 21:33:08 INFO org.apache.hudi.timeline.service.RequestHandler: 
TimeTakenMillis[Total=7, Refresh=0, handle=6, Check=1], Success=true, Query=...
   21/03/10 21:33:08 INFO org.apache.hudi.timeline.service.RequestHandler: 
TimeTakenMillis[Total=4, Refresh=1, handle=3, Check=0], Success=true, Query=...
   21/03/10 21:33:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114, 
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, 
cachingHitsRatio=0,evictions=29, evicted=0, evictedPerRun=0.0
   21/03/10 21:38:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114, 
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, 
cachingHitsRatio=0,evictions=59, evicted=0, evictedPerRun=0.0
   21/03/10 21:43:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114, 
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, 
cachingHitsRatio=0,evictions=89, evicted=0, evictedPerRun=0.0
   21/03/10 21:48:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114, 
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, 
cachingHitsRatio=0,evictions=119, evicted=0, evictedPerRun=0.0
   21/03/10 21:53:35 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
totalSize=1.21 MB, freeSize=1.15 GB, max=1.16 GB, blockCount=0, accesses=114, 
hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, 
cachingHitsRatio=0,evictions=149, evicted=0, evictedPerRun=0.0
   (repeats forever until spark job timeout)
   ```
   
   Compared to the other hudi writes that succeeded, this job that hanged does 
**not** produce an output like 
   `INFO org.apache.hudi.HoodieSparkSqlWriter$: No errors. Proceeding to commit 
the write.`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] stackfun opened a new issue #2672: [SUPPORT] Hang during MOR Upsert after a billion records

Reply via email to