EdColeman opened a new issue, #2820:
URL: https://github.com/apache/accumulo/issues/2820

   **Is your feature request related to a problem? Please describe.**
   As a follow-on to https://github.com/apache/accumulo/pull/1259, there may be 
 ways to improved the du command performance, possibly at the sacrifice of some 
accuracy.
   
   **Describe the solution you'd like**
   Ideally the du command could return nearly as fast as the hadoop directory 
usage command.
   
   **Additional context**
   There may be a conflict between how much space an Accumulo table needs - vs. 
how much space is it occupying on disk. Compactions in progress can increase 
the hdfs usage because they are created and stored on disk, but not yet part of 
the table "used" files. Files that are eligible for gc, but not deleted also 
inflate the hdfs space. 
   
   The metadata stores estimated entity counts and files size that is updated 
on a compaction. For bulk imports, the information may not be available.
   
   For a large table, does it matter?  You have a big number vs another bigger 
number. In most cases, the physically hdfs usage seems the most relevant. If 
the Accumulo size is a driving factor and a accurate number is needed - then 
you probably should run a compaction so that old data can be aged off, shared 
files consolidated,...At that point (barring additional bulk imports) the 
metadata values should be accurate and may be good enough. 
   
   The du command accounts for shared files - if a table is cloned, it is a 
metadata operation and the file references in the metadata are "copied" - so 
you cannot use the file size in hdfs under the table directory because the 
table could be sharing files that are in a different tables directory tree.
   
   Possible changes:
   
   - if no files are shared, just use the hdfs (hdfs -dus option) directory 
size of the table.
   - Maybe use the file entity / size estimates and then add the files size of 
bulk import files?
   - provide options to the command just get the entity / size estimates from 
the metadata or just the hdfs directory size and allow the used to figure out 
what they needed in the first place. (or show both and wish them luck...)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to