lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619306738


   ## Spark log analysis brainstorming 
   
   hi @vinothchandar, preliminary analysis about this issue
   (based on `release-0.5.0` branch 
https://github.com/apache/incubator-hudi/tree/release-0.5.0)
   
   ### Simplified origin spark log
   
   ```
   // Upsert part
   Warning: Ignoring non-spark config property: 
"spark.sql.hive.convertMetastoreParquet=false"
   Warning: Ignoring non-spark config property: 
"spark.serializer=org.apache.spark.serializer.KryoSerializer"
   20/04/22 20:12:14 WARN S3CryptoModuleAE: Unable to detect encryption 
information for object '<JARPATH>' in bucket '<storageBucket>'. Returning 
object without decryption.
   20/04/22 20:12:14 WARN HiveConf: HiveConf of name hive.server2.thrift.url 
does not exist
   20/04/22 20:12:14 INFO SparkContext: Running Spark version 2.4.4
   20/04/22 20:12:14 INFO SparkContext: Submitted application: HUDI incremental 
data loading
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing modify acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls groups to: 
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening 
'<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption 
information for object '<storgaeLocation>/.hoodie/hoodie.properties' in bucket 
'<storageBucket>'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline 
for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants 
java.util.stream.ReferencePipeline$Head@678852b5
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : 
[2001/05/02, 2001/05/07, 2001/05/09, 2001/05/10, 2001/05/17, 2001/05/18, 
2001/05/21, 2001/06/01, 2001/06/04, 2001/06/08, 2001/06/20, 2001/06/21, 
2001/07/17, 2001/07/23, 2001/07/25, 2001/07/30, 2001/08/02, 2001/08/03, 
2001/08/07, 2001/08/08, 2001/08/09, 2001/08/14, 2001/08/23, 2001/09/05, 
2001/09/06, 2001/09/07, 2001/09/13, 2001/09/14, 2001/10/02, 2001/10/03, 
2001/10/04, 2001/10/09, 2001/11/01, 2001/11/09, 2001/11/14, 2001/11/15, 
2001/11/16, 2001/11/19, 2001/11/20, 2001/11/21,
   20/04/22 20:24:36 INFO HoodieWriteClient: Marked clean started on 
20200422201250 as complete
   20/04/22 20:24:36 INFO HoodieWriteClient: Committed 20200422201250
   20/04/22 20:24:36 INFO HoodieSparkSqlWriter$: Commit 20200422201250 
successful!
   
   
   // CreateRelation(sqlContext, parameters, df.schema)
   20/04/22 20:24:36 INFO DefaultSource: Constructing hoodie (as parquet) data 
source with options ...
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], 
FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:38 INFO S3NativeFileSystem: Opening 
'<storageLocation>2001/05/10/.hoodie_partition_metadata' for reading
   20/04/22 20:24:38 WARN S3CryptoModuleAE: Unable to detect encryption 
information for object 
'<storgaeLocation>/2001/05/10/.hoodie_partition_metadata' in bucket 
'<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], 
FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening 
'<storageLocation>2001/05/17/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption 
information for object 
'<storgaeLocation>/2001/05/17/.hoodie_partition_metadata' in bucket 
'<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:39 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from <storgaeLocation>
   20/04/22 20:24:39 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], 
FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening 
'<storageLocation>2001/05/18/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption 
information for object 
'<storgaeLocation>/2001/05/18/.hoodie_partition_metadata' in bucket 
'<storageBucket>'. Returning object without decryption.
   
   // Finish part
   20/04/22 21:06:25 INFO SparkContext: Successfully stopped SparkContext
   20/04/22 21:06:25 INFO ShutdownHookManager: Shutdown hook called
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-c24e90ad-7be5-4a04-b4f1-2575cf68bd5a
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-c18b2f4f-e525-43de-b896-75244fe63591
   ```
   
   #### Analysis
   
   
![image](https://user-images.githubusercontent.com/20113411/80269021-c7225880-86de-11ea-84fe-464d08fc4e9b.png)
   
   
   ### Analysis about step 2
   1. Call stackstrace
   ```
   "main" Id=1 cpuUsage=94% RUNNABLE
       ...
       at org.apache.hudi.common.util.FSUtils.processFiles(FSUtils.java:227)
       at 
org.apache.hudi.common.util.FSUtils.getAllFoldersWithPartitionMetaFile(FSUtils.java:191)
       at org.apache.hudi.common.util.FSUtils.getAllPartitionPaths(FSUtils.java)
       at 
org.apache.hudi.table.HoodieCopyOnWriteTable.clean(HoodieCopyOnWriteTable.java:285)
       at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:950)
   ```
   2. IMO, `Step 2` affected by NN's RPC performance.
   
   ### Analysis about step 4
   When I set `WARN` log level, the clean operation run faster. Performance 
loss due to the interaction between driver and executor?
   
   
![image](https://user-images.githubusercontent.com/20113411/80269122-75c69900-86df-11ea-9a34-b805b694f9e9.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to