Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/11520 )
Change subject: IMPALA-7543: Enhance scan ranges to support sub-ranges ...................................................................... Patch Set 4: (4 comments) http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/hdfs-file-reader.cc File be/src/runtime/io/hdfs-file-reader.cc: http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/hdfs-file-reader.cc@155 PS3, Line 155: *eof = true; > I renamed eosr to eof. I'm thinking that eof is the right way to go. The FileReader is a new abstraction, but I think it is right to avoid having it track end of scan range. FileReader doesn't really have a reason to know that. http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/request-ranges.h File be/src/runtime/io/request-ranges.h: http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/request-ranges.h@251 PS4, Line 251: /// Add sub-ranges to this ScanRange. If sub_ranges is not empty, then ScanRange won't : /// read everything from its range, but will only read these sub-ranges. : /// Sub-ranges need to be ordered by 'offset' and cannot overlap with each other. : /// Doesn't need to merge continuous sub-ranges in advance, this method will do. : void AddSubRanges(std::vector<SubRange>&& sub_ranges); Will we know the SubRanges when we call Reset()? (BaseScalarColumnReader::Reset()->HdfsScanNodeBase::AllocateScanRange()) I'm wondering whether it makes sense for Reset() to take in the SubRanges as an argument rather than having an AddSubRanges() function. I like Reset() behaving somewhat like a constructor and taking everything in one go. http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/scan-range.cc File be/src/runtime/io/scan-range.cc: http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/scan-range.cc@268 PS3, Line 268: "There should be no partial reads for sub-ranges." > My idea is to only issue such read operations (that go beyond the end of th I think it makes sense for us to view end-of-file or partial read as errors, but I also see it as something that is beyond our control. We don't have complete control on what the file is going to look like. It can be overwritten or modified in place. So, I would rather return an error Status (maybe SCANNER_INCOMPLETE_READ?) than have a DCHECK here. http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/scan-range.cc File be/src/runtime/io/scan-range.cc: http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/scan-range.cc@236 PS4, Line 236: buffer_desc->eosr_ |= bytes_read_ == bytes_to_read_; One thing we should try to do is make it clear what this calculation is doing (and why we are passing *eosr in as the eof arg). I'll think on this some more, but at the least, a comment explaining that eosr can be because we hit eof or because there are no more bytes to read (or for the subranges case, there are no more subranges to read). -- To view, visit http://gerrit.cloudera.org:8080/11520 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iea26ba386713990f7671aab5a372cf449b8d51e4 Gerrit-Change-Number: 11520 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 10 Oct 2018 05:18:27 +0000 Gerrit-HasComments: Yes
