Hello Jason Fehr, Michael Smith, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/20386

to look at the new patch set (#11).

Change subject: IMPALA-12389: Use -skipTrash to avoid accumulating trash
......................................................................

IMPALA-12389: Use -skipTrash to avoid accumulating trash

The default behavior for deleting files on Hadoop is to
move them to a trash folder. The trash folder can be
aged out, but Impala's developer environment sets the
trash to live a long time. This is a problem, because the
trash contents will continue to accumulate.

This combines multiple changes to avoid accumulating trash:
1. This changes HadoopFsCommandLineClient's delete_file_dir
   to use -skipTrash to avoid accumulating the trash for
   this case. It also modifies DelegatingHdfsClient to use
   HadoopFsCommandLineClient for delete_file_dir. The WebHDFS
   client doesn't have the option to skip trash. This does
   a quick existence check using WebHDFS to avoid the overhead
   of invoking the commandline for a location that doesn't exist.
2. This changes the unique_database fixture to delete the
   database directory before dropping the database. Non-external
   tables deleted as part of DROP DATABASE .. CASCADE are
   moved to the trash. Deleting the database directory ourselves
   avoids sending these files to the trash.
3. "hdfs dfs -expunge -immediate" can recover the disk space, but
   it is very slow. This increases the dfs.block.invalidate.limit
   to allow HDFS to delete more blocks in a single heartbeat.

To support this change, there were other test-only changes:
 - TestHdfsEncryption and TestHdfsPermissions used WebHDFS-style
   paths without the leading slash. This is incompatible with
   using the HDFS commandline for delete_file_dir, so it switches
   those tests to normal paths. This should be safe, because we
   always use the delegating client which removes slashes when
   it uses the WebHDFS client.
 - This relaxes the timing for TestRecursiveListing, because
   deletes via the Hadoop commandline are slower than deletes
   through the WebHDFS client.
 - This updates a few tests that placed tables outside of the
   unique_database. In particular, Iceberg tests using
   create_iceberg_table_from_directory() were putting tables
   outside the database.

Testing:
 - Ran tests locally and examined the trash directory

Change-Id: I2d304113596aaf70a122202a33276fc7c3d599e8
---
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
M tests/common/file_utils.py
M tests/conftest.py
M tests/metadata/test_hdfs_encryption.py
M tests/metadata/test_hdfs_permissions.py
M tests/metadata/test_recursive_listing.py
M tests/query_test/test_scanners.py
M tests/query_test/test_udfs.py
M tests/util/hdfs_util.py
M tests/util/iceberg_metadata_util.py
10 files changed, 48 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/86/20386/11
--
To view, visit http://gerrit.cloudera.org:8080/20386
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2d304113596aaf70a122202a33276fc7c3d599e8
Gerrit-Change-Number: 20386
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Jason Fehr <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>

Reply via email to