[GitHub] [accumulo] cshannon commented on issue #2820: Additional improvements to the du command,

GitBox Sat, 13 Aug 2022 07:55:08 -0700


cshannon commented on issue #2820:
URL: https://github.com/apache/accumulo/issues/2820#issuecomment-1214172137

@EdColeman -

Yesterday/today I spent a good amount of time diving into the Scan api and
implementation between the client and server to get a better feel how that
works and then I also started working on this a bit. I have a branch with a
rough prototype/proof of concept that is a work in progress here:
https://github.com/cshannon/accumulo/commits/accumulo-2820

It's not ready for a real review yet as there's more work to be done but you
can take a look if you get a chance and see the direction I'm going. I had a
couple questions/comments and wanted to get your thoughts.

1. The metadata table scan could technically just be done by the client
without an RPC call but I kept the current way of sending an RPC request and
letting the server do it inside of the
[TableDiskUsage](https://github.com/cshannon/accumulo/blob/9b6ae27be4ba50513cf41c58281324211f8e75d3/server/base/src/main/java/org/apache/accumulo/server/util/TableDiskUsage.java)
class. I think this is much better as it keeps the current design intact and
is a simpler update plus this utility already scans metadata for the file names
to use for the HDFS iterator. So it can simply be updated to read the sizes
from metadata instead and the client/shell code can more or less work the same
without many modifications.
2. I created new disk usage RPC call which same as the old with a new method
parameter. This will allow passing any options we want to custom the du
command when sent to the server for processing. The main thing now is a
[Mode](https://github.com/cshannon/accumulo/blob/9b6ae27be4ba50513cf41c58281324211f8e75d3/core/src/main/thrift-gen-java/org/apache/accumulo/core/clientImpl/thrift/TDiskUsageMode.java)
enum which just currently has FILE, DIRECTORY, METADATA. The idea is the user
running the command could specify how they want to compute the size. The
documentation will describe the benefits/drawbacks of each mode. FILE is just
the current default way of scanning the HDFS files, DIRECTORY would be for
using the hdfs -dus command (not implemented yet in my prototype) and METADATA
would be just scanning the metadata table. Having the options parameter and
enum for mode will allow us to easily expand in the future with any flags or
settings we want to compute usage.
3. I still need to update things to handle scanning the root table if
someone wants to know the metadata table size itself.
4. I haven't looked at bulk import stuff yet but that could be another mode
or just be included automatically, not sure.
5. Tests of course will still need to be updated and done.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [accumulo] cshannon commented on issue #2820: Additional improvements to the du command,

Reply via email to