Dimitris Tsirogiannis has posted comments on this change.
Change subject: IMPALA-1670,IMPALA-4141: Support multiple partitions in ALTER
TABLE ADD PARTITION
Patch Set 14:
Very nice! Minor nits in the code, some comments in testing.
PS14, Line 1629: .
" or null if the table is not altered because all the partitions already exist
and IF NOT EXISTS is specified".
PS14, Line 1635: If all partitions exist in
: * catalog cache, return null.
PS14, Line 1714: It is assumed that all Partition objects in 'aList' and
'bList' belong to the same
: * table.
I don't think this function is making use or enforcing this assumption
anywhere. So, you may want to remove this sentence.
PS14, Line 1753: lists of partition values
nit: maybe it's better to say "maps partitions (identified by their partition
values) to their corresponding HDFS caching ops."
PS14, Line 220: AnalysisError("alter table functional.alltypes add " + cl +
: " partition(year=2050, month=10)" +
: " partition(year=2050, month=11)" +
: " partition(Month=10, YEAR=2050) location" +
: "Duplicate partition spec: (month=10, year=2050)");
I think this case simply enforces the unique partition specs (similar to the
one in L215). So I am not sure if LOCATION adds anything more in terms of test
coverage in this case. Maybe just keep the previous case?
Line 246: }
Also you may want to add a test case where one of the partitions is cached in a
PS14, Line 172: def impala_partitions(self, table_name, *include_fields):
A few thoughts on this function:
1. The field positions seem to be hardcoded based on the current output of SHOW
PARTITIONS. Any change in the output will break the tests that rely on this
2. "impala_partitions" doesn't really indicate what to expect from this
function. Maybe "get_impala_partitions" or "get_impala_partition_info" is more
3. I find it awkward that by default it returns the partition values and all
other fields are optional. Personally, I'd change it so that, if include_fields
is not specified, return all fields. Otherwise, return only those fields in
PS14, Line 174: Impala sees them
maybe: "returned by a SHOW PARTITIONS statement."
Line 269: Passes 'command' to 'engine' callable and makes sure that it
raises an exception.
Thanks for adding the comment. Maybe an example of an 'engine' callable (in
terms of an existing Python class that can be used) would also help future
readers understand how to use this.
Line 701: "Partition already exists")
You may want to add an assert here to verify that no partitions were added to
PS14, Line 713: was left alone
Even partitions need to be left alone from time to time :). Maybe "was not
PS14, Line 714: # Sometimes metadata load is triggered here, so compare only
the first three
: # returned partitions.
Not sure I follow this comment. Can you explain the non-deterministic behavior
wrt metadata load?
PS14, Line 770: def test_partitions_exist_after_refresh(self, vector):
I don't think this is an integration test. You may want to move it to
test_partition_metadata.py and also cover the 'invalidate metadata' case. Also,
another integration test case that is currently not covered is adding
partitions using Impala and checking that they are accessible through Hive.
Line 800: table_name).get_data().split('\n')
Also try to add some new and existing partitions using Impala (maybe with
location and cached properties) and verify that new partitions are added
correctly and existing partitions are not modified. You may want to test w and
w/o IF NOT EXISTS.
To view, visit http://gerrit.cloudera.org:8080/4144
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bhara...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogian...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>