Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11520 )

Change subject: IMPALA-7543: Enhance scan ranges to support sub-ranges
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/hdfs-file-reader.cc@155
PS3, Line 155:         *eof = true;
> I renamed eosr to eof.
I'm thinking that eof is the right way to go. The FileReader is a new 
abstraction, but I think it is right to avoid having it track end of scan 
range. FileReader doesn't really have a reason to know that.


http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/request-ranges.h
File be/src/runtime/io/request-ranges.h:

http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/request-ranges.h@251
PS4, Line 251:   /// Add sub-ranges to this ScanRange. If sub_ranges is not 
empty, then ScanRange won't
             :   /// read everything from its range, but will only read these 
sub-ranges.
             :   /// Sub-ranges need to be ordered by 'offset' and cannot 
overlap with each other.
             :   /// Doesn't need to merge continuous sub-ranges in advance, 
this method will do.
             :   void AddSubRanges(std::vector<SubRange>&& sub_ranges);
Will we know the SubRanges when we call Reset()? 
(BaseScalarColumnReader::Reset()->HdfsScanNodeBase::AllocateScanRange()) I'm 
wondering whether it makes sense for Reset() to take in the SubRanges as an 
argument rather than having an AddSubRanges() function. I like Reset() behaving 
somewhat like a constructor and taking everything in one go.


http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/scan-range.cc
File be/src/runtime/io/scan-range.cc:

http://gerrit.cloudera.org:8080/#/c/11520/3/be/src/runtime/io/scan-range.cc@268
PS3, Line 268: "There should be no partial reads for sub-ranges."
> My idea is to only issue such read operations (that go beyond the end of th
I think it makes sense for us to view end-of-file or partial read as errors, 
but I also see it as something that is beyond our control. We don't have 
complete control on what the file is going to look like. It can be overwritten 
or modified in place. So, I would rather return an error Status (maybe 
SCANNER_INCOMPLETE_READ?) than have a DCHECK here.


http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/scan-range.cc
File be/src/runtime/io/scan-range.cc:

http://gerrit.cloudera.org:8080/#/c/11520/4/be/src/runtime/io/scan-range.cc@236
PS4, Line 236: buffer_desc->eosr_ |= bytes_read_ == bytes_to_read_;
One thing we should try to do is make it clear what this calculation is doing 
(and why we are passing *eosr in as the eof arg). I'll think on this some more, 
but at the least, a comment explaining that eosr can be because we hit eof or 
because there are no more bytes to read (or for the subranges case, there are 
no more subranges to read).



--
To view, visit http://gerrit.cloudera.org:8080/11520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iea26ba386713990f7671aab5a372cf449b8d51e4
Gerrit-Change-Number: 11520
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 10 Oct 2018 05:18:27 +0000
Gerrit-HasComments: Yes

Reply via email to