[
https://issues.apache.org/jira/browse/HADOOP-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Shvachko updated HADOOP-3069:
----------------------------------------
Attachment: TruncatePrimaryImageBug.patch
I am splitting my patch attached to HADOOP-2585 into 2 parts because people
want this bug fixed in 0.16 and 0.17 but don't want the import checkpoint
feature in these versions.
The patch contains also the test which fails on the old version but does not
fail on the new one.
> A failure on SecondaryNameNode truncates the primary NameNode image.
> --------------------------------------------------------------------
>
> Key: HADOOP-3069
> URL: https://issues.apache.org/jira/browse/HADOOP-3069
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.13.0
> Reporter: Konstantin Shvachko
> Assignee: Konstantin Shvachko
> Priority: Blocker
> Fix For: 0.17.0
>
> Attachments: TruncatePrimaryImageBug.patch
>
>
> When the primary name-node pulls the new image from the secondary,
> and the transfer fails for some reason then the primary considers the new
> image,
> which may not be completely transfered yet or may be not transfered at all,
> as a valid one and will roll it into the new files system image, which will
> be either corrupted or empty.
> The problem here is that the error message from the secondary node does not
> reach the primary.
> And this happens because TransferFsImage.getFileServer() closes the
> connection output stream
> in its finalize section. The secondary later sends the error reply which
> cannot be received by the primary
> and causes the following exception on the secondary:
> {code}
> 08/03/21 12:16:52 ERROR NameNode.Secondary: java.io.FileNotFoundException:
> \hadoop-data\hdfs\namesecondary\destimage.tmp (The system cannot find the
> file specified)
> 08/03/21 12:16:56 WARN /: /getimage?getimage=1:
> java.lang.IllegalStateException: Committed
> at
> org.mortbay.jetty.servlet.ServletHttpResponse.resetBuffer(ServletHttpResponse.java:212)
> at
> org.mortbay.jetty.servlet.ServletHttpResponse.sendError(ServletHttpResponse.java:375)
> at
> org.apache.hadoop.dfs.SecondaryNameNode$GetImageServlet.doGet(SecondaryNameNode.java:485)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
> at
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
> at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
> at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
> at
> org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
> at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
> at org.mortbay.http.HttpServer.service(HttpServer.java:954)
> at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
> at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
> at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
> at
> org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
> at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
> at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
> {code}
> But the exception does not effect the behavior of the primary node. Since the
> stream is closed the primary thinks
> the file transfer was successfully finished and acts further accordingly.
> There 2 bugs that need to be fixed here.
> # The error message should be delivered to the primary, and the primary
> should not corrupt its image in case of an error.
> # The doGet() method of both HttpServlet-s should catch not only
> IOException-s but any exceptions.
> If we miss NPE or SecurityException the main image will truncated.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.