GitHub user manishgupta88 opened a pull request:
https://github.com/apache/carbondata/pull/2868
[WIP] Improve drop table performance by reducing the namenode RPC calls
during physical deletion of files
**Problem**
Current drop table command takes more than 1 minute to delete 3000 files
during drop table operation from HDFS
**Analysis**
Even though we are using HDFS file system we are explicitly we are
recursively iterating through the table folders and deleting each file. For
each file deletion and file listing one rpc call is made to namenode. To delete
3000 files 3000 rpc calls are made to namenode for file deletion and few more
rpc calls for file listing in each folder.
**Solution**
HDFS provides an API for deleting all folders and files recursively for a
given path in a single RPC call. Use that API and improve the drop table
operation performance.
**Result:** After these code changes drop table operation time to delete
3000 files from HDFS has reduced from 1 minute to ~2 sec.
- [ ] Any interfaces changed?
No
- [ ] Any backward compatibility impacted?
No
- [ ] Document update required?
No
- [ ] Testing done
Verified on cluster
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
NA
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/manishgupta88/carbondata drop_table_slow
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/carbondata/pull/2868.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2868
----
commit f79f0fa351ed76cb74fe441f7d13cf756d49cb4c
Author: manishgupta88 <tomanishgupta18@...>
Date: 2018-10-29T06:09:09Z
Modified code to improve the drop table command performance
----
---