nsivabalan edited a comment on pull request #2168:
URL: https://github.com/apache/hudi/pull/2168#issuecomment-736665788
@n3nash : the patch is good to be reviewed. would appreciate if you can take
some time to close this out.
and a question on DFSHoodieDatasetInputReader. I see when we try to read
lesser no of records than total records in one file slice, the reader returns
entire contents from one file slice. Is that intentional or is it a bug ?
For eg: if I insert 1500 records
and next node is upsert of 200. this will actually update all 1500 records
and not 200(assuming all 1500 is in one file slice).
I ran into this issue while testing out deletes with this patch. I thot
there was some issue w/ deletes, since I saw all records were getting deleted
;) and after investigation, found its the reader thats doing this.
```
if (!numFiles.isPresent() || numFiles.get() == 0) {
// If num files are not passed, find the number of files to update
based on total records to update and records
// per file
numFilesToUpdate = (int) Math.ceil((double) numRecordsToUpdate.get() /
recordsInSingleFile);
// recordsInSingleFile is not average so we still need to account for
bias is records distribution
// in the files. Limit to the maximum number of files available.
int totalExistingFilesCount =
partitionToFileIdCountMap.values().stream().reduce((a, b) -> a + b).get();
numFilesToUpdate = Math.min(numFilesToUpdate, totalExistingFilesCount);
log.warn("aaa Files to update {}, numRecords toUpdate {}, records in
Single file {} ", numFilesToUpdate, numRecordsToUpdate, recordsInSingleFile);
numRecordsToUpdatePerFile = recordsInSingleFile;
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]