Joe McDonnell has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/14060


Change subject: [WIP] IMPALA-8821: Use RECOVER PARTITIONS in dataload to get 
partition metadata
......................................................................

[WIP] IMPALA-8821: Use RECOVER PARTITIONS in dataload to get partition metadata

When using a data snapshot without a metadata snapshot (e.g. when loading to
a remote cluster), the data is already in place, and dataload needs to perform
all the appropriate DDLs to create the metadata. Currently, for dynamically
partitioned tables, dataload gives up and reloads those tables from scratch
in this circumstance. However, there is no need to do this, as ALTER TABLE..
RECOVER PARTITIONS can get the partition metadata by looking at the filesystem.

This changes dataload to use RECOVER PARTITIONS for dynamically partitioned
tables rather than forcing a reload of the table. Dataload from scratch is
not impacted, because there is no existing data and everything needs to be
inserted anyway. Dataload with both a data snapshot and a metadata snapshot
also is not impacted, because testdata/bin/create-load-data.sh skips most
of the bin/load-data.py calls for that codepath. So, this is limited to
dataload with a data snapshot and without a metadata snapshot. The biggest
impact of this is the TPC-DS store_sales does not have to be reloaded from
scratch in this case.

Impala dataload overrides the default location for table directories to
its own weird nonstandard location. These locations reside outside the
database *.db directories. The current table existence check is tuned to
handle tables that reside in directories with this naming system. It does
not handle tables that use the default location (i.e. the location if
LOCATION is not specified). This detects tables using the standard directory
naming and uses a different table existence check for those tables. This
eliminates the need to reload these tables.

Callers of bin/load-data.py always have the option of forcing a reload
via the --force_reload flag.

Testing:
 - Ran normal dataload (no snapshots)
 - Ran dataload with just a data snapshot (no metadata snapshot)
 - Ran dataload with a data snapshot and a metadata snapshot

Change-Id: I2622cd3655cf4521d5ac945759fd35c9abe670ef
---
M testdata/bin/generate-schema-statements.py
1 file changed, 47 insertions(+), 16 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/14060/1
--
To view, visit http://gerrit.cloudera.org:8080/14060
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2622cd3655cf4521d5ac945759fd35c9abe670ef
Gerrit-Change-Number: 14060
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <joemcdonn...@cloudera.com>

Reply via email to