[GitHub] [hudi] nsivabalan edited a comment on pull request #2168: [HUDI-1331] Adding support for validating entire dataset and long running tests in test suite framework

GitBox Tue, 01 Dec 2020 08:33:48 -0800


nsivabalan edited a comment on pull request #2168:
URL: https://github.com/apache/hudi/pull/2168#issuecomment-736665788



   @n3nash : the patch is good to be reviewed. would appreciate if you can take 
some time to close this out.
   
   and a question on DFSHoodieDatasetInputReader. I see when we try to read 
lesser no of records than total records in one file slice, the reader returns 
entire contents from one file slice. Is that intentional or is it a bug ? 
   
   For eg: if I insert 1500 records 
   and next node is upsert of 200. this will actually update all 1500 records 
and not 200(assuming all 1500 is in one file slice). 
   
   I ran into this issue while testing out deletes with this patch. I thot 
there was some issue w/ deletes, since I saw all records were getting deleted 
;) and after investigation, found its the reader thats doing this. 
   
   ```
   if (!numFiles.isPresent() || numFiles.get() == 0) {
         // If num files are not passed, find the number of files to update 
based on total records to update and records
         // per file
         numFilesToUpdate = (int) Math.ceil((double) numRecordsToUpdate.get() / 
recordsInSingleFile);
         // recordsInSingleFile is not average so we still need to account for 
bias is records distribution
         // in the files. Limit to the maximum number of files available.
         int totalExistingFilesCount = 
partitionToFileIdCountMap.values().stream().reduce((a, b) -> a + b).get();
         numFilesToUpdate = Math.min(numFilesToUpdate, totalExistingFilesCount);
         log.warn("aaa Files to update {}, numRecords toUpdate {}, records in 
Single file {} ", numFilesToUpdate, numRecordsToUpdate, recordsInSingleFile);
         numRecordsToUpdatePerFile = recordsInSingleFile;
       }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan edited a comment on pull request #2168: [HUDI-1331] Adding support for validating entire dataset and long running tests in test suite framework

Reply via email to