[ 
https://issues.apache.org/jira/browse/HDFS-14075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695357#comment-16695357
 ] 

Ayush Saxena commented on HDFS-14075:
-------------------------------------

{quote}Do you know of any other part of the code that does this exception 
creation thing?
{quote}
FSEditLog ---> logSync(..) I have seen it here only as of now.Will try finding 
if any other occurrence.

[~trjianjianjiao] Thanx for giving a look.
{quote}state in incorrect IN_SEGMENT.
{quote}
Not in IN_SEGMENT but in *BETWEEN_LOG_SEGMENTS* The endLogSegment() changed the 
state.
{quote}I prefer to return false when editLogStream is null, meaning that it is 
not ready to sync.
{quote}
If we do so firstly the state will be in BETWEEN_LOG_SEGMENTS (Write operations 
are accepted in this state and queued assuming ASAP the rollEdits releases lock 
they can be pushed.That immediate pushing after Exception gut us that NPE since 
stream was null). The rollEdits will be triggered again and again but the 
Pre-condition check in endLogSegment() will always throw us out.Some how if we 
change state too and allow a retry for EndLogSegment() .Then again when the 
EndLogSegment() will be called then there logSync() will actually lead to 
terminating the NN.
{quote}or should it continues without syncing until some time the JNs are back?
{quote}
Mostly this happening is fatal only.Admin configured that way that say X 
redundant Journals should always be available.If that is not happening.Some 
basic requirement got violated.So we should terminate only.

By far if we think of retrying. we can't allow transaction too because 
editStream() is null.We will be in a loop only retrying and denying operations. 
(But if JN's didnt come up automatically?) Usually in such JN failure Admin 
intervention will be required only.Coming up automatically doesn't seems to 
happen.The Admin would usually need to come up handle all such failures and 
restart the cluster.

> NPE while Edit Logging
> ----------------------
>
>                 Key: HDFS-14075
>                 URL: https://issues.apache.org/jira/browse/HDFS-14075
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ayush Saxena
>            Assignee: Ayush Saxena
>            Priority: Critical
>         Attachments: HDFS-14075-01.patch, HDFS-14075-02.patch, 
> HDFS-14075-03.patch, HDFS-14075-04.patch, HDFS-14075-04.patch, 
> HDFS-14075-04.patch, HDFS-14075-05.patch, HDFS-14075-06.patch
>
>
> {noformat}
> 2018-11-10 18:59:38,427 FATAL 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Exception while edit 
> logging: null
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.doEditTransaction(FSEditLog.java:481)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$Edit.logEdit(FSEditLogAsync.java:288)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:232)
>  at java.lang.Thread.run(Thread.java:745)
> 2018-11-10 18:59:38,532 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: Exception while edit logging: null
> 2018-11-10 18:59:38,552 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> SHUTDOWN_MSG:
> {noformat}
> Before NPE Received the following Exception
> {noformat}
> INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 65110, call 
> Call#23241 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 
> XXXXXXXX
> java.io.IOException: Unable to start log segment 7964819: too few journals 
> successfully started.
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1385)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegmentAndWriteHeaderTxn(FSEditLog.java:1395)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1319)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1352)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4669)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:1293)
>       at 
> org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:146)
>       at 
> org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12974)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
>       at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2684)
> Caused by: java.io.IOException: starting log segment 7964819 failed for too 
> many journals
>       at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:412)
>       at 
> org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:207)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1383)
>       ... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to