[GitHub] [hudi] tommss opened a new issue, #6038: [SUPPORT] MOR taking more time using HoodieJavaWriteClient

GitBox Mon, 04 Jul 2022 09:57:59 -0700


tommss opened a new issue, #6038:
URL: https://github.com/apache/hudi/issues/6038


   I am doing a PoC of HUDI and I noticed that while using 
HoodieJavaWriteClient.java, the writes in case of MOR are taking more time when 
compared to COW.
   But when using DataFrameReader, MOR is faster than COW for the same dataset.
   In both cases I am doing 'insert' operation.
   In case of java client, I notice a gradual decrease in throughput. There are 
total of 7million rows and I am batching with 200k rows.
   I have deployed code in Azure databricks cluster and worker nodes pick up 
the task of executing below code snippet.
   Attaching code snippet using java client :
   
   ```
             HoodieKey key = new HoodieKey(UUID.randomUUID().toString(), 
"partitionPath");
             HoodieAvroPayload payload = new HoodieAvroPayload(Option.of(rec));
             HoodieAvroRecord<HoodieAvroPayload> record = new 
HoodieAvroRecord<>(key, payload);
             records.add(record);
           }
         }
   
         String tableName = "tableName";
         String tablePath = "abfss://[email protected]/" + 
tableName;
         Configuration hadoopConf = new Configuration();
         hadoopConf.set("fs.azure.account.key","xxx");
         Path path = new Path(tablePath);
         FileSystem fs = FSUtils.getFs(tablePath, hadoopConf);
         if (!fs.exists(path)) {
           HoodieTableMetaClient.withPropertyBuilder()
               .setTableType(tableType)
               .setTableName(tableName)
               .setPayloadClassName(HoodieAvroPayload.class.getName())
               .initTable(hadoopConf, tablePath);
         }
   
         HoodieWriteConfig cfg = 
HoodieWriteConfig.newBuilder().withPath(tablePath)
             .withSchema(avroSchema.toString())
             .withParallelism(2, 2)
             .withDeleteParallelism(2)
             .forTable(tableName)
             
.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.INMEMORY).build())
             
.withCompactionConfig(HoodieCompactionConfig.newBuilder().archiveCommitsWith(20,
 30).build())
             .build();
         HoodieJavaWriteClient<HoodieAvroPayload> client =
             new HoodieJavaWriteClient<>(new 
HoodieJavaEngineContext(hadoopConf), cfg);
   
         String commitTime = client.startCommit();
         client.insert(records, commitTime);
         client.close();
   ```
   
-------------------------------------------------------------------------------------
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : Azure storage account - StorageV2 (general 
purpose v2)
   
   * Running on Docker? (yes/no) : no
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] tommss opened a new issue, #6038: [SUPPORT] MOR taking more time using HoodieJavaWriteClient

Reply via email to