EdColeman commented on PR #2809:
URL: https://github.com/apache/accumulo/pull/2809#issuecomment-1187715715

   The du command (and I assume this one) can perform poorly with tables that 
have large number of files.  There are a subtleties involved with trying to 
determine "size"
   
   - This command and the shell du command both seem to require an Accumulo 
instance (at least a metadata table). Theoretically the information could be 
read directly from ZooKeeper - sort of a virtual metadata table, but not sure 
it is worth it. @milleruntime comment that this is redundant holds if both 
require an active instance.
   
   For the du command in general:
   
   - The metadata stores estimated entity counts and files size that is updated 
on a compaction. For bulk imports, the information may not be available.
   - These commands are trying to account for shared files - if a table is 
cloned, it is a metadata operation and the file references in the metadata are 
"copied" - so you cannot use the file size in hdfs under the table directory 
because the table *could be* sharing files that are in a different directory.
   - When files are *NOT* shared, could something like the hdfs -dus option be 
used?
   - There may be a conflict between how much space an Accumulo table needs - 
vs. how much space is it occupying on disk. Compactions in progress can 
increase the hdfs usage because they are created stored on disk, but not yet 
part of the table "used" files.  Files that are eligible for gc, but not 
deleted also inflate the hdfs space.  For a large table, does it matter - you 
have a big number vs another bigger number.  In most cases, the physically hdfs 
usage seems the most relevant.  If the Accumulo size is a driving factor and a 
accurate number is needed - then you probably should run a compaction so that 
old data can be aged off, shared files consolidated,...
   
   The may be some optimizations for the du command in general (likely a 
separate issue as a follow on to https://github.com/apache/accumulo/pull/1259) 
   
   - if no files are shared, just use the hdfs directory size of the table.
   - Maybe use the file entity / size estimates and then add the files size of 
bulk import files?
   - provide options to just get the entity / size estimates from the metadata 
or just the hdfs directory size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to