Skye Wanderman-Milne has posted comments on this change.

Change subject: IMPALA-1578: fix text scanner to handle "\r\n" delimiters split 
across blocks
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/2803/4/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 256: with
> on the '\r' of a "\r\n" delimiter
Done


Line 604:         if (!eosr) continue;
> this would be easier to read if it were:
I moved these cases to the while condition instead, I think this is most clear 
but lemme know if you prefer something else.


Line 615:       DCHECK(*tuple_found);
> can't we get here with !*tuple_found if eosr is true?  a missing test case?
I think we can only get here without tuple_found if FillByteBuffer() retrieves 
0 bytes, which would mean the scan range has 0 bytes (otherwise we would break 
on eosr before calling FillByteBuffer() past the end of the scan range).

This might be possible with bad metadata, but I don't think so. I just tried 
manually creating a one-file text table, then deleting the file, and setting 
the scan range size to small enough to get two scan ranges. Querying results in 
"Error seeking to 300 in file" error. We do have testing for deleting a parquet 
file (test_stale_metadata.py), although not this specific case.

I can add "byte_buffer_read_size_ > 0" to the check to be safe, but I think 
better to hit the DCHECK and understand how we hit this case then to mask it.


http://gerrit.cloudera.org:8080/#/c/2803/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 446:   def test_text_split_across_buffers_delimiter(self, vector, 
unique_database):
> add brief comment explaining this test case
Done. Also added coverage for a split-across-buffers delimiter in the main 
algorithm (right now it only tests the FindFirstTuple logic)


Line 470:       check_call(['hadoop', 'fs', '-copyFromLocal', f.name, location])
> does this overwrite if the file already exists?
No. We're using unique_database so it shouldn't already exist.


-- 
To view, visit http://gerrit.cloudera.org:8080/2803
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id42b441674bb21517ad2788b99942a4b5dc55420
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to