prashantwason commented on a change in pull request #2197:
URL: https://github.com/apache/hudi/pull/2197#discussion_r513059843



##########
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java
##########
@@ -95,11 +110,19 @@ public DeltaGenerator(DeltaConfig deltaOutputConfig, 
JavaSparkContext jsc, Spark
     int numPartitions = operation.getNumInsertPartitions();
     long recordsPerPartition = operation.getNumRecordsInsert() / numPartitions;
     int minPayloadSize = operation.getRecordSize();
-    JavaRDD<GenericRecord> inputBatch = jsc.parallelize(Collections.EMPTY_LIST)
-        .repartition(numPartitions).mapPartitions(p -> {
+    int startPartition = operation.getStartPartition();

Review comment:
       Suppose you insert 5 partitions. Then the following 5 new 
LazyRecordGeneratorIterator will be created:
      new LazyRecordGeneratorIterator(..., 0)
      new LazyRecordGeneratorIterator(..., 1)
      new LazyRecordGeneratorIterator(..., 2)
      new LazyRecordGeneratorIterator(..., 3)
      new LazyRecordGeneratorIterator(..., 4)
   
   Within the LazyRecordGeneratorIterator code, the integer for partition index 
(0, 1, .. above) are converted into partition timstamp (as date offset from 
1970/01/01). So the first LazyRecordGeneratorIterator will be generating 
records from 1970/01/01, the second LazyRecordGeneratorIterator will generate 
records for 1970/01/02 ... and so on.
   
   With this schema, the record generation always starts at offset 0. But what 
if you want to generate for only a specific partition? Or add new partition? 
This is where the start_offset comes into play.
   
      new LazyRecordGeneratorIterator(..., 0 + start_offset)
      new LazyRecordGeneratorIterator(..., 1 + start_offset)
      new LazyRecordGeneratorIterator(..., 2 + start_offset)
      new LazyRecordGeneratorIterator(..., 3 + start_offset)
      new LazyRecordGeneratorIterator(..., 4 + start_offset)
   
   By using a start_offset you can alter where the inserts will take place. 
Also new partitions can be created.
   
   
   Spark retries can alter the partition numbers here. For that, we can use a 
pre-formatted List with partitions here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to