Tim Armstrong has posted comments on this change. Change subject: IMPALA-5412 Scan returns wrong partition-column values when scanning multiple partitions pointing to the same filesystem location. ......................................................................
Patch Set 1: (12 comments) http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/base-sequence-scanner.cc File be/src/exec/base-sequence-scanner.cc: Line 90: context->partition_descriptor()->id(),stream_->filename())); nit:space after , http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: PS1, Line 266: string Do we need to call the string constructor explicitly? Seems a bit weird. PS1, Line 634: name) { Nit: long line > 90 chars Line 648: void* HdfsScanNodeBase::GetFileMetadata( Nit: I think this line fits in 90 characters. http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: Line 197: /// Allocate a new scan range object, stored in the runtime state's object pool. For This comment needs updating since partition_id is now required. Line 360: struct pair_hash { We have some utilities like this already in util/container-util.h, let's move it there. Line 363: return std::hash<T2>{}(p.second); Let's hash both values in the pair and combine them with boost::hash_combine. The current approach probably works ok for this use case but it seems better to implement a general-purpose pair hash that can be reused, given it's not much more work. http://gerrit.cloudera.org:8080/#/c/7625/1/tests/metadata/test_partition_metadata.py File tests/metadata/test_partition_metadata.py: Line 43: """Regression test for IMPALA-597. Verifies Impala is able to properly read Can we run this test for Parquet too? TestInsertQueries has text and parquet in its test matrix and uses the "STORED AS" clause too, so we can copy its approach. Line 71: data = self.execute_scalar("select sum(i), sum(j) from %s" % FQ_TBL_NAME) Can we run this query with num_nodes=1 to verify that the same bug doesn't exist with text file? Line 80: def test_multiple_partitions_same_location_avro(self, vector, unique_database): Test looks good, I think we should just make sure we address the full test coverage gap for the other file formats too. Let's run this test for the other affected formats with an unsupported writers - SequenceFile. If we change the test matrix for this class as I suggest above, we probably want to pull out this test into a separate class with a different test matrix. PS1, Line 81: imapala Impala Line 110: # (note, that shortcoming on avro writer would result the inserted value as NULL, How about we only query the second column to avoid this complication? -- To view, visit http://gerrit.cloudera.org:8080/7625 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ie74b305377248045c0d87b911943e1cabb7223e9 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Attila Jeges <[email protected]> Gerrit-Reviewer: Laszlo Gaal <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
