[
https://issues.apache.org/jira/browse/HDFS-14355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800514#comment-16800514
]
Feilong He commented on HDFS-14355:
-----------------------------------
[~daryn], thanks so much for your pretty valuable comments.
{quote}I quickly skimmed the patch. At a high level, a plugin design should not
leak details. This patch breaks the abstraction and litters the code with
references to "pmem" and explicit conditionals. The details of pmem should be
pushed down and hidden in the pmem specific classes.
{quote}
We tried to separate our impl for pmem cache with the current HDFS code. We
introduced some classes for pmem such as PmemMappableBlockLoader,
PmemMappedBlock, PmemVolumeManager. In FsdatasetCache, we indeed kept some
references for pmem, for example pmemUsedBytesCount, which may be one issue you
pointed out implicitly. In our new patch, pmemUsedBytesCount and
reserve/release methods for pmem will be removed from FsdatasetCache to a new
class PmemVolumeManager. We are trying to shade such unnecessarily exposed
details for pmem as you suggested.
{quote}Adding another reference (cacheFilePath) to {{ReplicaInfo}} is less than
desirable from a memory perspective. For those not using the feature, it's not
nearly as bad as adding one to inodes but may be an issue for very dense DNs.
More importantly, those that DO use the feature will incur a substantial memory
hit to store the paths. Why does a SCM block need to track a completely
different path? Isn't the SCM just another storage dir?
{quote}
The cacheFilePath in ReplicaInfo is just used for pmem cache. Since multiple
pmems can be configured by user, to locate the cache file on pmem, it is
necessary to keep such info for a cached block. Also, we seriously considered
your reasonable concerns that adding cacheFilePath there may cause issues for
very Dense DNs. Thanks for pointing out this. We will optimize this part with a
new patch. In the new patch, we will remove cacheFilePath from ReplicaInfo.
Instead, we will add a PmemVolumeManager for pmem cache to keep
cacheFilePath(precisely, pmem volume index, which is enough for inferring the
cache file path in our new impl). TB sized HDFS pmem cache should be able to
cache around 10k blocks at most. In our evaluation, pmem cache would not
consume substantial DRAM space to maintain which pmem volume a block is cached
to for 10k level blocks. On the other hand, enabling pmem cache can alleviate
the pressure of competing DRAM resource.
{quote}{{getCacheLoaderClaZZ}} should be {{getCacheLoaderClaSS}} and the
returned {{Class}} must be instantiated. It's wrong to compare the class simple
name against {{MemoryMappableBlockLoader}}. Consider if a user configures with
{{my.custom.MemoryMappableBlockLoader}} and it instantiates
{{org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.MemoryMappableBlockLoader}}
anyway because of the matching simple name? Catching
{{ReflectiveOperationException}} and just logging it masks a serious
misconfiguration or error condition.
{quote}
The {{getCacheLoaderClaZZ}} has been changed to {{getCacheLoaderClass }}as you
suggested. Your suggestion makes us aware of that comparing class simple name
is inconsiderate. We have fixed this issue in our new patch. Since the
constructor of cache loader requires a {{FsdatasetCache}} instance as its
parameter, we still instantiate it there as you noted. The
{{getCacheLoaderClass}} comes from DNconf and it should not depend on
{{FsdatasetCache}} to return a instantiated cache loader. As you pointed out,
it is not reasonable to catch {{ReflectiveOperationException and }}merely log
it. We throw a RuntimeException inside the catch block to terminate the start
of DN.
{quote}There's quite a few other exceptions being swallowed or just logged when
dubious "bad things" happen. Or discarding of exceptions and rethrowing generic
"something went wrong" exceptions w/o the original exception as a cause. That
complicates debugging.
{quote}
We are checking the code scrupulously and trying our best to fix the issues you
mentioned to avoid complicating debugging. For loading pmem volume, we just log
the error if one volume is not usable. We thought it is tolerable to have just
one bad pmem volume configured by user. But an exception will be thrown to
terminate the DN starting if all pmem volumes are invalid.
{quote}New methods in {{FsDatasetUtil}} are not ok. Invalid arguments are not
"ignorable". That's how bugs creep in which are insanely hard to debug. Don't
check if offset is valid, just seek, let it throw if invalid. Don't ignore null
filenames, just let it throw. Catching {{Throwable}} is a bad practice; let
alone catching, discarding, and throwing a bland exception.
{quote}
Good suggestion! We noted the issues in {{FsDatasetUtil}}. As you pointed out,
the invalid offset and null filenames shouldn't be ignored. We are also
checking and fixing such issue in other piece of code.
{quote}Why throw a {{RuntimeException}} in "getOneLocation"? Is the intent to
take down the DN?
{quote}
Indeed, it is not reasonable to throw RuntimeException in #getOneLocation. It
is replaced by IOException in our new patch. Then this exception will just lead
to a typical cache failure and will not impact on the DN working.
{quote}Why do a costly compute of the checksum for a mmap file? Isn't the
client is going to compute the checksum anyway?
{quote}
As you may know, the current implementation of HDFS DRAM cache does checksum
when caching data to DRAM. In our implementation, Pmem cache still does
checksum to guarantee the reliability of cached data, which also aligns with
the implementation of HDFS DRAM cache.
Thank you again for your valuable comments. We are still refining the code
based on your comments. After that, a new patch will be uploaded.
> Implement HDFS cache on SCM by using pure java mapped byte buffer
> -----------------------------------------------------------------
>
> Key: HDFS-14355
> URL: https://issues.apache.org/jira/browse/HDFS-14355
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: caching, datanode
> Reporter: Feilong He
> Assignee: Feilong He
> Priority: Major
> Attachments: HDFS-14355.000.patch, HDFS-14355.001.patch,
> HDFS-14355.002.patch, HDFS-14355.003.patch
>
>
> This task is to implement the caching to persistent memory using pure
> {{java.nio.MappedByteBuffer}}, which could be useful in case native support
> isn't available or convenient in some environments or platforms.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]