EdColeman opened a new issue, #2820: URL: https://github.com/apache/accumulo/issues/2820
**Is your feature request related to a problem? Please describe.** As a follow-on to https://github.com/apache/accumulo/pull/1259, there may be ways to improved the du command performance, possibly at the sacrifice of some accuracy. **Describe the solution you'd like** Ideally the du command could return nearly as fast as the hadoop directory usage command. **Additional context** There may be a conflict between how much space an Accumulo table needs - vs. how much space is it occupying on disk. Compactions in progress can increase the hdfs usage because they are created and stored on disk, but not yet part of the table "used" files. Files that are eligible for gc, but not deleted also inflate the hdfs space. The metadata stores estimated entity counts and files size that is updated on a compaction. For bulk imports, the information may not be available. For a large table, does it matter? You have a big number vs another bigger number. In most cases, the physically hdfs usage seems the most relevant. If the Accumulo size is a driving factor and a accurate number is needed - then you probably should run a compaction so that old data can be aged off, shared files consolidated,...At that point (barring additional bulk imports) the metadata values should be accurate and may be good enough. The du command accounts for shared files - if a table is cloned, it is a metadata operation and the file references in the metadata are "copied" - so you cannot use the file size in hdfs under the table directory because the table could be sharing files that are in a different tables directory tree. Possible changes: - if no files are shared, just use the hdfs (hdfs -dus option) directory size of the table. - Maybe use the file entity / size estimates and then add the files size of bulk import files? - provide options to the command just get the entity / size estimates from the metadata or just the hdfs directory size and allow the used to figure out what they needed in the first place. (or show both and wish them luck...) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
