Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/8747 )
Change subject: IMPALA-5993: Fix the file offset in value parsing error ...................................................................... Patch Set 1: (12 comments) I looked through the change. I can't say that I completely understand the design of the C++ part, so my comments for that part are simple local ones. http://gerrit.cloudera.org:8080/#/c/8747/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8747/7//COMMIT_MSG@11 PS7, Line 11: hit an error, it always prints the end of the file as the nit: hits http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@365 PS7, Line 365: typo: calculated http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@373 PS7, Line 373: ers to ex typo: difference http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@373 PS7, Line 373: aths in Parquet scanner. Returns the status Do you mean "taking read_bytes as argument instead of calculating it"? http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.h@375 PS7, Line 375: Status ScannerDebugAction() WARN_UNUSED_RESULT { nit: remove extra space http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.cc@497 PS7, Line 497: // In this branch, we've just materialized slots not referenced by any conjunct. Checking LogHasSpace() is not necessary here - LogError() checks if there is space for the log anyway, so the check is only useful to avoid the runtime cost of creating the error message if it would be dropped anyway. This is important if the logic is called a lot of time, for example for millions of rows. This part will be called only once (per fragment), as the function returns on the first error. http://gerrit.cloudera.org:8080/#/c/8747/7/be/src/exec/hdfs-scanner.cc@675 PS7, Line 675: stream_->filename(), line, pos, offset); : state_->LogError(ErrorMsg(TErrorCode::GENERAL, s)); : } The Substitute should be moved inside the if block - as I mentioned in line 497, the reason to call LogHasSpace() is to avoid the runtime cost of creating the error message. http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@847 PS7, Line 847: : : : : : : : : Can you explain why this function is needed? Other scanner test classes do not seem to do similar initialization. http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@875 PS7, Line 875: This step could be skipped, if the the table was created first (which creates an HDFS directory) and the data was copied afterwards. http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@890 PS7, Line 890: Why not use the default location ( /test-warehouse/$db_name.db/$table_name )? This line could be removed in that case. Note that assuming that the databases are in /test-warehouse/ makes the tests less flexible, but other tests already rely on that, and the tests are expected to run on the Impala minicluster - many tests would fail on a different cluster, for example planner tests are sensitive to many cluster parameters. So I think that it is not a goal at the moment to make the tests runnable on other clusters. If it will be a goal, then a general solution should be created (probably in ImpalaTestSuite), not custom solutions for different tests. http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@893 PS7, Line 893: : nit: indentation and spaces around '=' In Impala, new blocks are indented with 2 spaces, while continuations of broken lines are intended with 4 spaces (most of the times...). This is a difference compared to PEP8, but makes the python code more consistent with C++ and Java code in Impala. See the python part of https://cwiki.apache.org/confluence/display/IMPALA/Impala+Style+Guide for the list differences from PEP8. http://gerrit.cloudera.org:8080/#/c/8747/7/tests/query_test/test_scanners.py@899 PS7, Line 899: I would prefer to use self.execute_query(). The result it returns has a .log member, which can be split by "\n" to get the list of warnings. Another way to do this is to create a .test file in testdata/workloads, list the warnings in the "---- ERRORS" section and run it with self.run_test_case(). See https://github.com/apache/impala/blob/89b29d3a89cca041c541476e25435dc0d34344c1/tests/common/test_result_verifier.py#L314 for the way it processes the logs. -- To view, visit http://gerrit.cloudera.org:8080/8747 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4a4ac3199ffe12fcce0bf792b3e6ce529b9af61f Gerrit-Change-Number: 8747 Gerrit-PatchSet: 1 Gerrit-Owner: Kim Jin Chul <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Kim Jin Chul <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Comment-Date: Tue, 03 Apr 2018 15:28:43 +0000 Gerrit-HasComments: Yes
