[ 
https://issues.apache.org/jira/browse/HADOOP-17975?focusedWorklogId=682322&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-682322
 ]

ASF GitHub Bot logged work on HADOOP-17975:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Nov/21 23:36
            Start Date: 16/Nov/21 23:36
    Worklog Time Spent: 10m 
      Work Description: fapifta commented on a change in pull request #3658:
URL: https://github.com/apache/hadoop/pull/3658#discussion_r750770358



##########
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
##########
@@ -1679,11 +1679,13 @@ private Connection getConnection(ConnectionId remoteId,
     private final boolean doPing; //do we need to send ping message
     private final int pingInterval; // how often sends ping to the server in 
msecs
     private String saslQop; // here for testing
+    private final AtomicBoolean fallbackToSimpleAuth;
     private final Configuration conf; // used to get the expected kerberos 
principal name
     
     ConnectionId(InetSocketAddress address, Class<?> protocol, 
                  UserGroupInformation ticket, int rpcTimeout,
-                 RetryPolicy connectionRetryPolicy, Configuration conf) {
+                 RetryPolicy connectionRetryPolicy, Configuration conf,
+                 AtomicBoolean fallbackToSimpleAuth) {

Review comment:
       Yes, they are not equals, but let me re-iterate... they should not be 
equals.
   Three things to note here:
   - every DfsClient creates a new AtomicBoolean initialized to false
   - these atomic booleans are set to true by ipc Client's Connection's 
setupIOStreams method which can be called only after we have a connection based 
on the ConnectionID which we would than possibly change in setupIOStreams.
   - in the ipc Client layer, a connection to a NameNode is not different from 
a connection to a DataNode, as it is a connection to a host port combination 
with given network related setup.
   
   Comparing the boolean values instead of the atomic boolean reference will 
bring us back to the initial problem. We will have a false value when we first 
try to get the connection based on the ConnectionID, then after we get the 
connection we change the value to true. Which is even worse, as in this case if 
fallback is on, we would have for sure two connections even for a single 
DfsClient, as when we get the connection the second time, the value would be 
true, so we would create an other connection for the new ConnectionID, and we 
possibly leak the previous one, as we would never create a ConnectionID with 
false again from that client. While on the other hand, from a second DfsClient, 
we would get the connection with the ID contains the false value, and we would 
have the problem we are trying to fix (unset AtomicBoolean prevents fallback of 
the second DfsClient).
   
   
   
   About the number of connections we discussed in the previous PR, though at 
that point NN was out of picture, but yes, this change affects the NN 
connections as well if there are multiple DfsClients that are active.
   
   > Your suggestion to add the client's ID to the ConnectionID would help, but 
there is a tradeoff there, as in this case client B has to set up a new socket 
connection to DataNode D.
   >That would cause two things:
   >
   > 1. is an overhead of creating a new connection instead of reusing what we 
already have.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 682322)
    Time Spent: 7.5h  (was: 7h 20m)

> Fallback to simple auth does not work for a secondary DistributedFileSystem 
> instance
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-17975
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17975
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: István Fajth
>            Assignee: István Fajth
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> The following code snippet demonstrates what is necessary to cause a failure 
> in connection to a non secure cluster with fallback to SIMPLE auth allowed 
> from a secure cluster.
> {code:java}
>     Configuration conf = new Configuration();
>     conf.setBoolean("ipc.client.fallback-to-simple-auth-allowed", true);
>     URI fsUri = new URI("hdfs://<nn_uri>");
>     conf.setBoolean("fs.hdfs.impl.disable.cache", true);
>     FileSystem fs = FileSystem.get(fsUri, conf);
>     FSDataInputStream src = fs.open(new Path("/path/to/a/file"));
>     FileOutputStream dst = new FileOutputStream(File.createTempFile("foo", 
> "bar"));
>     IOUtils.copyBytes(src, dst, 1024);
>     // The issue happens even if we re-enable cache at this point
>     //conf.setBoolean("fs.hdfs.impl.disable.cache", false);
>     // The issue does not happen when we close the first FileSystem object
>     // before creating the second.
>     //fs.close();
>     FileSystem fs2 = FileSystem.get(fsUri, conf);
>     FSDataInputStream src2 = fs2.open(new Path("/path/to/a/file"));
>     FileOutputStream dst2 = new FileOutputStream(File.createTempFile("foo", 
> "bar"));
>     IOUtils.copyBytes(src2, dst2, 1024);
> {code}
> The problem is that when the DfsClient is created it creates an instance of 
> AtomicBoolean, which is propagated down into the IPC layer, where the 
> Client.Connection instance in setupIOStreams sets its value. This connection 
> object is cached and re-used to multiplex requests against the same DataNode.
> In case of creating a second DfsClient, the AtomicBoolean reference in the 
> client is a new AtomicBoolean, but the Client.Connection instance is the 
> same, and as it has a socket already open to the DataNode, it returns 
> immediatelly from setupIOStreams, leaving the fallbackToSimpleAuth 
> AtomicBoolean false as it is created in the DfsClient.
> This AtomicBoolean on the other hand controls how the SaslDataTransferClient 
> handles the connection in the above level, and with this value left on the 
> default false, the SaslDataTransferClient of the second DfsClient will not 
> fall back to SIMPLE authentication but will try to send a SASL handshake when 
> connecting to the DataNode.
>  
> The access to the FileSystem via the second DfsClient fails with exceptions 
> like the following one, then fails the read with a BlockMissingException like 
> below:
> {code}
> WARN hdfs.DFSClient: Failed to connect to /<dn_ip>:<dn_port> for file <file> 
> for block BP-531773307-<nn_ip>-1634685133591:blk_1073741826_1002, add to 
> deadNodes and continue. 
> java.io.EOFException: Unexpected EOF while trying to read response from server
>       at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:215)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:455)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>       at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:648)
>       at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2980)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:658)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:589)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:771)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
>       at java.io.DataInputStream.read(DataInputStream.java:100)
>       at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
>       at DfsClientTest3.main(DfsClientTest3.java:30)
> {code}
> {code}
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
> BP-813026743-<nn_ip>-1495248833293:blk_1139767762_66027405 file=/path/to/file
> {code}
>  
> The DataNode in the meantime logs the following:
> {code}
> ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
> <dn_host>:<dn_port>:DataXceiver error processing unknown operation  src: 
> /<client_ip>:<client_port> dst: /<dn_ip>:<dn_port>
> java.io.IOException: Version Mismatch (Expected: 28, Received: -8531 )
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:70)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:222)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> This happens only if the second client is connecting to the same DataNode as 
> the first one did, so might seem intermittent in case the clients are reading 
> different files, but happens always if the two client reads the same file 
> with replication factor 1.
> We ran into this issue during running HBase ExportSnapshot tool to move a 
> snapshot from a non-secure to a secure cluster, the issue is loosely related 
> to HBASE-12819 and HBASE-20433 and similar problems, I am linking these so 
> that HBase team will see how this is relevant for them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to