cshannon commented on issue #2820:
URL: https://github.com/apache/accumulo/issues/2820#issuecomment-1214172137

   @EdColeman -
   
   Yesterday/today I spent a good amount of time diving into the Scan api and 
implementation between the client and server to get a better feel how that 
works and then I also started working on this a bit. I have a branch with a 
rough prototype/proof of concept that is a work in progress here: 
https://github.com/cshannon/accumulo/commits/accumulo-2820
   
   It's not ready for a real review yet as there's more work to be done but you 
can take a look if you get a chance and see the direction I'm going. I had a 
couple questions/comments and wanted to get your thoughts.
   
   1. The metadata table scan could technically just be done by the client 
without an RPC call but I kept the current way of sending an RPC request and 
letting the server do it inside of the 
[TableDiskUsage](https://github.com/cshannon/accumulo/blob/9b6ae27be4ba50513cf41c58281324211f8e75d3/server/base/src/main/java/org/apache/accumulo/server/util/TableDiskUsage.java)
 class. I think this is much better as it keeps the current design intact and 
is a simpler update plus this utility already scans metadata for the file names 
to use for the HDFS iterator. So it can simply be updated to read the sizes 
from metadata instead and the client/shell code can more or less work the same 
without many modifications.
   2. I created new disk usage RPC call which same as the old with a new method 
parameter.  This will allow passing any options we want to custom the du 
command when sent to the server for processing. The main thing now is a 
[Mode](https://github.com/cshannon/accumulo/blob/9b6ae27be4ba50513cf41c58281324211f8e75d3/core/src/main/thrift-gen-java/org/apache/accumulo/core/clientImpl/thrift/TDiskUsageMode.java)
 enum which just currently has FILE, DIRECTORY, METADATA. The idea is the user 
running the command could specify how they want to compute the size. The 
documentation will describe the benefits/drawbacks of each mode. FILE is just 
the current default way of scanning the HDFS files, DIRECTORY would be for 
using the hdfs -dus command (not implemented yet in my prototype) and METADATA 
would be just scanning the metadata table. Having the options parameter and 
enum for mode will allow us to easily expand in the future with any flags or 
settings we want to compute usage.
   3. I still need to update things to handle scanning the root table if 
someone wants to know the metadata table size itself.
   4. I haven't looked at bulk import stuff yet but that could be another mode 
or just be included automatically, not sure.
   5. Tests of course will still need to be updated and done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to