EdColeman commented on PR #2809: URL: https://github.com/apache/accumulo/pull/2809#issuecomment-1187715715
The du command (and I assume this one) can perform poorly with tables that have large number of files. There are a subtleties involved with trying to determine "size" - This command and the shell du command both seem to require an Accumulo instance (at least a metadata table). Theoretically the information could be read directly from ZooKeeper - sort of a virtual metadata table, but not sure it is worth it. @milleruntime comment that this is redundant holds if both require an active instance. For the du command in general: - The metadata stores estimated entity counts and files size that is updated on a compaction. For bulk imports, the information may not be available. - These commands are trying to account for shared files - if a table is cloned, it is a metadata operation and the file references in the metadata are "copied" - so you cannot use the file size in hdfs under the table directory because the table *could be* sharing files that are in a different directory. - When files are *NOT* shared, could something like the hdfs -dus option be used? - There may be a conflict between how much space an Accumulo table needs - vs. how much space is it occupying on disk. Compactions in progress can increase the hdfs usage because they are created stored on disk, but not yet part of the table "used" files. Files that are eligible for gc, but not deleted also inflate the hdfs space. For a large table, does it matter - you have a big number vs another bigger number. In most cases, the physically hdfs usage seems the most relevant. If the Accumulo size is a driving factor and a accurate number is needed - then you probably should run a compaction so that old data can be aged off, shared files consolidated,... The may be some optimizations for the du command in general (likely a separate issue as a follow on to https://github.com/apache/accumulo/pull/1259) - if no files are shared, just use the hdfs directory size of the table. - Maybe use the file entity / size estimates and then add the files size of bulk import files? - provide options to just get the entity / size estimates from the metadata or just the hdfs directory size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
