Skye Wanderman-Milne has posted comments on this change. Change subject: IMPALA-1578: fix text scanner to handle "\r\n" delimiters split across blocks ......................................................................
Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/2803/4/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 256: with > on the '\r' of a "\r\n" delimiter Done Line 604: if (!eosr) continue; > this would be easier to read if it were: I moved these cases to the while condition instead, I think this is most clear but lemme know if you prefer something else. Line 615: DCHECK(*tuple_found); > can't we get here with !*tuple_found if eosr is true? a missing test case? I think we can only get here without tuple_found if FillByteBuffer() retrieves 0 bytes, which would mean the scan range has 0 bytes (otherwise we would break on eosr before calling FillByteBuffer() past the end of the scan range). This might be possible with bad metadata, but I don't think so. I just tried manually creating a one-file text table, then deleting the file, and setting the scan range size to small enough to get two scan ranges. Querying results in "Error seeking to 300 in file" error. We do have testing for deleting a parquet file (test_stale_metadata.py), although not this specific case. I can add "byte_buffer_read_size_ > 0" to the check to be safe, but I think better to hit the DCHECK and understand how we hit this case then to mask it. http://gerrit.cloudera.org:8080/#/c/2803/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: Line 446: def test_text_split_across_buffers_delimiter(self, vector, unique_database): > add brief comment explaining this test case Done. Also added coverage for a split-across-buffers delimiter in the main algorithm (right now it only tests the FindFirstTuple logic) Line 470: check_call(['hadoop', 'fs', '-copyFromLocal', f.name, location]) > does this overwrite if the file already exists? No. We're using unique_database so it shouldn't already exist. -- To view, visit http://gerrit.cloudera.org:8080/2803 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id42b441674bb21517ad2788b99942a4b5dc55420 Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
