[jira] [Created] (HDFS-14646) Standby NameNode should terminate the FsImage put process as soon as possible if the peer NN is not in the appropriate state to receive an image.

2019-07-14 Thread Xudong Cao (JIRA)
Xudong Cao created HDFS-14646:
-

 Summary: Standby NameNode should terminate the FsImage put process 
as soon as possible if the peer NN is not in the appropriate state to receive 
an image.
 Key: HDFS-14646
 URL: https://issues.apache.org/jira/browse/HDFS-14646
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.1.2
Reporter: Xudong Cao
Assignee: Xudong Cao
 Attachments: blockWriiting.png, get1.png, get2.png, largeSendQ.png

*Problem Description:*
In multi-NameNode scenario, when an SNN uploads a FsImage, it will put the 
image to all other NNs (whether the peer NN is an ANN or not), and even if the 
peer NN immediately replies with an error (such as 
TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
.OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
process immediately, but will put the FsImage completely to the peer NN, and 
will not read the peer NN's reply until the put is completed.

In a relatively large HDFS cluster, the size of FsImage can often reach about 
30G. In this case, this invalid put brings two problems:
1. Wasting time and bandwidth.
2. Since the ImageServlet of the peer NN no longer receives the FsImage, the 
socket Send-Q of the local SNN is very large, and the ImageUpload thread will 
be blocked in writting socket for a long time, eventually causing the local 
StandbyCheckpointer thread often blocked for several hours.

*An example is as follows:*
In the following figure, the local NN 100.76.3.234 is an SNN, the peer NN 
100.76.3.170 is another SNN, and the 8080 is NN Http port. When the local SNN 
starts to put FsImage, 170 will reply with a NOT_ACTIVE_NAMENODE_FAILURE error 
immediately. In this case, local SNN should terminate put immediately, but in 
fact, local SNN has to wait until the image has been completely put to peer 
NN,and then canl read the response.
 # At this time, since the ImageServlet of the peer NN no longer receives the 
FsImage, the socket Send-Q of the local SNN is very large:

!largeSendQ.png!

      2. Moreover, the local SNN's ImageUpload thread will be blocked in 
writing socket for a long time:

!blockWriiting.png!

 

     3. Eventually, the StandbyCheckpointer thread of local SNN is waiting for 
the execution result of the ImageUpload thread, blocking in Future.get(), and 
the blocking time may be as long as several hours:

!get1.png!

!get2.png!

 

*Solution:*
When the local SNN is ready to put a FsImage to the peer NN, it need to test 
whether he really need to put it at this time. The test process is:
 # Establish an HTTP connection with the peer NN, send a put request, and then 
immediately read the response (this is the key point). If the peer NN replies 
with any of the following errors (TransferResult.AUTHENTICATION_FAILURE, 
TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult.
 # If the peer NN is truly the ANN and can receive the FsImage normally, it 
will reply to the local SNN with an HTTP response 410 
(HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE). At 
this time, the local SNN can really begin to put the image.

*Note:*
This problem needs to be reproduced in a large cluster (the size of FsImage in 
our cluster is about 30G). Therefore, unit testing is difficult to write. In 
our real cluster, after the modification, the problem has been solved. There is 
no such thing as a large backlog of Send-Q.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Re: complete MiniDFSCluster.shutdown()

2019-07-14 Thread Mikhail Khludnev
Hello, Steve.
Tried to sweep all remainings of MiniDFSCluster
https://jira.apache.org/jira/browse/SOLR-13630.
Once again I need it for tests, just to make sure that everything stop
before starting new test that's not what's absolutely necessary for real
life usage. Here's the list what's keep running after shutdown:
 -
o.a.h.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.addReplicaThreadPool
is supposed to be shutdown via jvm shutdown hook
 - o.a.h.fs.FileSystem.Statistics.STATS_DATA_CLEANER never stops
 - o.a.h.hdfs.DFSClient.getLeaseRenewer().interruptAndJoin() is never
called.
 - o.a.h.hdfs.ClientContext.getPeerCache().close() isn't called as well
 - o.a.h.ipc.ProtobufRpcEngine.CLIENTS aren't closed
 - o.a.h.ipc.Client.clientExcecutorFactory keeps
.Client.ClientExecutorServiceFactory.clientExecutor. It's expected to be
stopped via ref counting, but when I've checked it with debugger after
MiniDFSCluster.shutdown() ref counter was behind hundred. Seems like no
way.
I think it's far from being fully shutdown.

On Wed, Jul 3, 2019 at 5:27 PM Steve Loughran 
wrote:

> if its not a UGI token renewer, it's probably a bug
>
> On Wed, Jul 3, 2019 at 4:08 PM Mikhail Khludnev  wrote:
>
> > Hello,
> > There are integration tests between Solr and HDFS. Turns out even calling
> > MiniDFSCluster.shutdown() leaves some HDFS threads running. It's probably
> > ok, but it upsets Solr's tests since they assert that there's nothing
> > running after test is over to verify against leaks.
> > Are there any total shutdown helper for test? Does it make sense to
> raise a
> > jira for it?
> >
> > Thank you.
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>


-- 
Sincerely yours
Mikhail Khludnev


[jira] [Created] (HDDS-1796) SCMClientProtocolServer#getContainerWithPipeline should check for admin access

2019-07-14 Thread Mukul Kumar Singh (JIRA)
Mukul Kumar Singh created HDDS-1796:
---

 Summary: SCMClientProtocolServer#getContainerWithPipeline should 
check for admin access
 Key: HDDS-1796
 URL: https://issues.apache.org/jira/browse/HDDS-1796
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: SCM
Affects Versions: 0.4.0
Reporter: Mukul Kumar Singh


SCMClientProtocolServer#getContainerWithPipeline currently calls 
checkAdminAccess with user as null.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-14645) ViewFileSystem should close the child FileSystems in close()

2019-07-14 Thread Jihyun Cho (JIRA)
Jihyun Cho created HDFS-14645:
-

 Summary: ViewFileSystem should close the child FileSystems in 
close()
 Key: HDFS-14645
 URL: https://issues.apache.org/jira/browse/HDFS-14645
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 2.7.8, 3.3.0
Reporter: Jihyun Cho
 Attachments: HDFS-14645.001.patch

{{ViewFileSystem}} uses superclass's {{close}} in current implementation.
It removes from {{FileSystem.CACHE}} without closing the child FileSystems.
To close properly, when FileSystem is closing, its child FileSystems also 
should be closed.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: branch2+JDK7 on Linux/x86

2019-07-14 Thread Apache Jenkins Server
For more details, see 
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/

No changes




-1 overall


The following subsystems voted -1:
asflicense compile findbugs hadolint mvninstall mvnsite pathlen unit xml


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck shelldocs whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

XML :

   Parsing Error(s): 
   
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/conf/empty-configuration.xml
 
   hadoop-tools/hadoop-azure/src/config/checkstyle-suppressions.xml 
   hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/public/crossdomain.xml 
   
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/public/crossdomain.xml
 

FindBugs :

   module:hadoop-common-project/hadoop-common 
   Class org.apache.hadoop.fs.GlobalStorageStatistics defines non-transient 
non-serializable instance field map In GlobalStorageStatistics.java:instance 
field map In GlobalStorageStatistics.java 

FindBugs :

   
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice-hbase/hadoop-yarn-server-timelineservice-hbase-client
 
   Boxed value is unboxed and then immediately reboxed in 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result,
 byte[], byte[], KeyConverter, ValueConverter, boolean) At 
ColumnRWHelper.java:then immediately reboxed in 
org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.readResultsWithTimestamps(Result,
 byte[], byte[], KeyConverter, ValueConverter, boolean) At 
ColumnRWHelper.java:[line 335] 

Failed junit tests :

   hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys 
   hadoop.hdfs.web.TestWebHdfsTimeouts 
   hadoop.hdfs.server.datanode.TestDirectoryScanner 
   hadoop.hdfs.server.balancer.TestBalancerRPCDelay 
   hadoop.registry.secure.TestSecureLogins 
   hadoop.yarn.server.timelineservice.security.TestTimelineAuthFilterForV2 
  

   mvninstall:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-mvninstall-root.txt
  [820K]

   compile:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-compile-root-jdk1.7.0_95.txt
  [60K]

   cc:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-compile-root-jdk1.7.0_95.txt
  [60K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-compile-root-jdk1.7.0_95.txt
  [60K]

   compile:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-compile-root-jdk1.8.0_212.txt
  [60K]

   cc:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-compile-root-jdk1.8.0_212.txt
  [60K]

   javac:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-compile-root-jdk1.8.0_212.txt
  [60K]

   checkstyle:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out//testptch/patchprocess/maven-patch-checkstyle-root.txt
  []

   hadolint:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/diff-patch-hadolint.txt
  [4.0K]

   mvnsite:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/patch-mvnsite-root.txt
  [76K]

   pathlen:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/pathlen.txt
  [12K]

   pylint:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/diff-patch-pylint.txt
  [24K]

   shellcheck:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/diff-patch-shellcheck.txt
  [72K]

   shelldocs:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/diff-patch-shelldocs.txt
  [8.0K]

   whitespace:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/whitespace-eol.txt
  [12M]
   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/whitespace-tabs.txt
  [1.2M]

   xml:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/xml.txt
  [12K]

   findbugs:

   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/branch-findbugs-hadoop-common-project_hadoop-common-warnings.html
  [8.0K]
   
https://builds.apache.org/job/hadoop-qbt-branch2-java7-linux-x86/382/artifact/out/branch-findbugs-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-app.txt
  [8.0K]