Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5412 Scan returns wrong partition-column values when 
scanning multiple partitions pointing to the same filesystem location.
......................................................................


Patch Set 1:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 90:           context->partition_descriptor()->id(),stream_->filename()));
nit:space after ,


http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

PS1, Line 266: string
Do we need to call the string constructor explicitly? Seems a bit weird.


PS1, Line 634: name) {
Nit: long line > 90 chars


Line 648: void* HdfsScanNodeBase::GetFileMetadata(
Nit: I think this line fits in 90 characters.


http://gerrit.cloudera.org:8080/#/c/7625/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

Line 197:   /// Allocate a new scan range object, stored in the runtime state's 
object pool. For
This comment needs updating since partition_id is now required.


Line 360:   struct pair_hash {
We have some utilities like this already in util/container-util.h, let's move 
it there.


Line 363:         return std::hash<T2>{}(p.second);
Let's hash both values in the pair and combine them with boost::hash_combine. 
The current approach probably works ok for this use case but it seems better to 
implement a general-purpose pair hash that can be reused, given it's not much 
more work.


http://gerrit.cloudera.org:8080/#/c/7625/1/tests/metadata/test_partition_metadata.py
File tests/metadata/test_partition_metadata.py:

Line 43:     """Regression test for IMPALA-597. Verifies Impala is able to 
properly read
Can we run this test for Parquet too?

TestInsertQueries has text and parquet in its test matrix and uses the "STORED 
AS" clause too, so we can copy its approach.


Line 71:     data = self.execute_scalar("select sum(i), sum(j) from %s" % 
FQ_TBL_NAME)
Can we run this query with num_nodes=1 to verify that the same bug doesn't 
exist with text file?


Line 80:   def test_multiple_partitions_same_location_avro(self, vector, 
unique_database):
Test looks good, I think we should just make sure we address the full test 
coverage gap for the other file formats too.

Let's run this test for the other affected formats with an unsupported writers 
- SequenceFile.

If we change the test matrix for this class as I suggest above, we probably 
want to pull out this test into a separate class with a different test matrix.


PS1, Line 81: imapala
Impala


Line 110:     # (note, that shortcoming on avro writer would result the 
inserted value as NULL,
How about we only query the second column to avoid this complication?


-- 
To view, visit http://gerrit.cloudera.org:8080/7625
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie74b305377248045c0d87b911943e1cabb7223e9
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Laszlo Gaal <laszlo.g...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to