[ https://issues.apache.org/jira/browse/HDFS-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615665#comment-13615665 ]
Suresh Srinivas commented on HDFS-4489: --------------------------------------- bq. I think you're also adding an extra 8 bytes on the arrays – the array length as I understand it is a field within the 16byte object header (occupying the second half of the klassId field). If you have an authoritative source, please send me that. I cannot understand how 16 byte object header have spare of say possible 8 bytes to track array length. Some of of my previous instrumentation had led me to conclude the the array length is 4 bytes for 32bit JVM and 8 bytes for 64 bit JVM. See discussion here - http://www.javamex.com/tutorials/memory/object_memory_usage.shtml. bq. a typical image with ~50M files will only need ~5M unique name byte[] objects, so I think it's unfair to count the above against the inode. That is a fair point. But my own inodes occupies 1/3rd of java heap is also an approximation and in practice I would think it inodes occupy smaller than that. I would like to run an experiment on a large production image. But I do not have ready access to it and will have to spend time getting to it. Do you have any? bq. but I'm afraid it may look closer to 10+% in practice. I do not think it will be close to 10%, but lets say it is. I do not see much issues with it. When we did some of the optimizations earlier, we were not sure how JVM would do if goes closes to 64G and hence wanted to keep the heap size down. But since then many large installations have successfully, without any issues gone beyond that size. Smaller installations should be able to spare, say, 10% extra heap. But if that is not acceptable, here are the alternatives I see: # Add configuration options to turn this feature off. Not instantiating GSet will reduce the overhead by 1/3rd. This is simple to do. # Make more optimizations at the expense of code complexity. I would like to avoid this. But if it is deemed very important, with some optimizations, we can get it close to 0%. > Use InodeID as as an identifier of a file in HDFS protocols and APIs > -------------------------------------------------------------------- > > Key: HDFS-4489 > URL: https://issues.apache.org/jira/browse/HDFS-4489 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Reporter: Brandon Li > Assignee: Brandon Li > > The benefit of using InodeID to uniquely identify a file can be multiple > folds. Here are a few of them: > 1. uniquely identify a file cross rename, related JIRAs include HDFS-4258, > HDFS-4437. > 2. modification checks in tools like distcp. Since a file could have been > replaced or renamed to, the file name and size combination is no t reliable, > but the combination of file id and size is unique. > 3. id based protocol support (e.g., NFS) > 4. to make the pluggable block placement policy use fileid instead of > filename (HDFS-385). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira