[
https://issues.apache.org/jira/browse/HDFS-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmytro Molkov resolved HDFS-1065.
---------------------------------
Resolution: Duplicate
This issue is being worked on in HDFS-1481, so closing this one as duplicate
> Secondary Namenode fails to fetch image and edits files
> -------------------------------------------------------
>
> Key: HDFS-1065
> URL: https://issues.apache.org/jira/browse/HDFS-1065
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.20.2
> Reporter: Dmytro Molkov
>
> We recently started experiencing problems where Secondary Namenode fails to
> fetch the image from the NameNode. The basic problem is described in
> HDFS-1024, but that JIRA was only dealing with possible data corruption,
> since then we got to a place where we could not compact the fsimage anymore
> because failures were 100% of the time.
> Here is what we have found out:
> The fetch still fails with the same exception as the HDFS-1024 (Jetty closes
> the connection before the file is sent)
> We suspect the underlying reason to be extensive garbage collection on the
> NameNode (1/5 of all time is being spent in garbage collection). And the
> reason for that might be the bug that is solved with HADOOP-6577 (we have a
> lot of large RPC requests, which means we allocate and free a lot of memory
> all the time).
> Because of GC the speed of the transfer drops to 700Kb/s
> Having said all of that current mechanism of fetching the image is still
> potentially flawed. When dealing with large images namenode is under stress
> of sending multigig files over the wire to the client while still serving
> requests.
> This JIRA is to discuss the possible ways of separating NameNode and the
> image fetching by the secondary namenode.
> One thought we had was fetching the image using SCP rather than HTTP download
> from the NameNode.
> This way the NameNode will have less pressure on it, on the other hand this
> will introduce new components that are not exactly under hadoop control (ssh
> client and server).
> To deal with possible data corruption with SCP copy we would also want to
> extend CheckpointSignature to have checksum on the file, so it can be checked
> on the client side.
> Please let me know what you think.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.