prashantwason opened a new pull request #2197:
URL: https://github.com/apache/hudi/pull/2197


   ## What is the purpose of the pull request
   
   Please see HUDI-1351 for description of the issues that are being fixed here.
   
   ## Brief change log
   
   1. Added the --clean-input and --clean-output parameters to clean the input 
and output directories before starting the job
   2. Added the --delete-old-input parameter to deleted older batches for data 
already ingested. This helps keep number of redundant files low.
   3. Added the --input-parallelism parameter to restrict the parallelism when 
generating input data. This helps keeping the number of generated input files 
low.
   4. Added an option start_offset to Dag Nodes. Without ability to specify 
start offsets, data is generated into existing partitions. With start offset, 
DAG can control on which partition, the data is to be written.
   5. Fixed generation of records for correct number of partitions
     - In the existing implementation, the partition is chosen as a random 
long. This does not guarantee exact number of requested partitions to be 
created.
   6. Changed variable blacklistedFields to be a Set as that is faster than 
List for membership checks.
   7. Fixed integer division for Math.ceil. If two integers are divided, the 
result is not double unless one of the integer is casted to double.
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to