[jira] [Work logged] (HADOOP-17975) Fallback to simple auth does not work for a secondary DistributedFileSystem instance

ASF GitHub Bot (Jira) Fri, 05 Nov 2021 19:33:17 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17975?focusedWorklogId=678004&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-678004
 ]


ASF GitHub Bot logged work on HADOOP-17975:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Nov/21 02:32
            Start Date: 06/Nov/21 02:32
    Worklog Time Spent: 10m 
      Work Description: fapifta commented on pull request #3579:
URL: https://github.com/apache/hadoop/pull/3579#issuecomment-962377604


   @symious I would say yes, and no :)
   
   We have 3 levels in the communication, the DfsClient which connects to HDFS, 
and is one of many users of the SASL protocol layer (SaslRpcClient), which uses 
the basic network communication layer the ipc client. (At least as I understand 
the system so far, so correct me if I am wrong.)
   
   In our problem scenario, we have two DfsClient which uses two separate 
SaslRpcClient, but under the hood they are using the same ipc client, as ipc 
clients are cached in the RpcEngine, regardless of whether we cache the 
DfsClient or not in a higher layer.
   
   Your suggestion is this as I understand:
   In order to distinguish between two DfsClient, we would need to add a client 
id into the ConnectionId in the ipc layer, which would distinguish between the 
users of the ipc client class inside the ipc client layer.
   This would require to add a bunch of new methods in the RpcEngine layer, to 
get the protobuf protocol proxies with a proper ConnectionId set, and may 
affect not just DfsClient, but every other client of the ipc client layer.
   
   I believe it is already unfortunate that the ipc layer decides and controls 
whether the SASL layer uses SASL or falls back to simple auth based 
communication, if we even add client ids from one more layer up, then the 
behaviour of the SASL layer would be defined by the layer above SASL via the 
layer under SASL that does not seem better at all.
   
   In my view, the high level client id does not, and should not distinguish 
between connections, the connection itself is a socket with one end on a port 
on the local machine, and the other end on a port on a remote machine. It does 
not really matter who uses the connection, also it should not really matter who 
uses the connection, until the user wants to use the connection between the two 
ports on the local and remote machine with the same settings. So if you look at 
it this way, the things in the equals method all qualify to differentiate 
between two connection, while the id of the user really does not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 678004)
    Time Spent: 5h 10m  (was: 5h)

> Fallback to simple auth does not work for a secondary DistributedFileSystem 
> instance
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-17975
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17975
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: István Fajth
>            Assignee: István Fajth
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> The following code snippet demonstrates what is necessary to cause a failure 
> in connection to a non secure cluster with fallback to SIMPLE auth allowed 
> from a secure cluster.
> {code:java}
>     Configuration conf = new Configuration();
>     conf.setBoolean("ipc.client.fallback-to-simple-auth-allowed", true);
>     URI fsUri = new URI("hdfs://<nn_uri>");
>     conf.setBoolean("fs.hdfs.impl.disable.cache", true);
>     FileSystem fs = FileSystem.get(fsUri, conf);
>     FSDataInputStream src = fs.open(new Path("/path/to/a/file"));
>     FileOutputStream dst = new FileOutputStream(File.createTempFile("foo", 
> "bar"));
>     IOUtils.copyBytes(src, dst, 1024);
>     // The issue happens even if we re-enable cache at this point
>     //conf.setBoolean("fs.hdfs.impl.disable.cache", false);
>     // The issue does not happen when we close the first FileSystem object
>     // before creating the second.
>     //fs.close();
>     FileSystem fs2 = FileSystem.get(fsUri, conf);
>     FSDataInputStream src2 = fs2.open(new Path("/path/to/a/file"));
>     FileOutputStream dst2 = new FileOutputStream(File.createTempFile("foo", 
> "bar"));
>     IOUtils.copyBytes(src2, dst2, 1024);
> {code}
> The problem is that when the DfsClient is created it creates an instance of 
> AtomicBoolean, which is propagated down into the IPC layer, where the 
> Client.Connection instance in setupIOStreams sets its value. This connection 
> object is cached and re-used to multiplex requests against the same DataNode.
> In case of creating a second DfsClient, the AtomicBoolean reference in the 
> client is a new AtomicBoolean, but the Client.Connection instance is the 
> same, and as it has a socket already open to the DataNode, it returns 
> immediatelly from setupIOStreams, leaving the fallbackToSimpleAuth 
> AtomicBoolean false as it is created in the DfsClient.
> This AtomicBoolean on the other hand controls how the SaslDataTransferClient 
> handles the connection in the above level, and with this value left on the 
> default false, the SaslDataTransferClient of the second DfsClient will not 
> fall back to SIMPLE authentication but will try to send a SASL handshake when 
> connecting to the DataNode.
>  
> The access to the FileSystem via the second DfsClient fails with exceptions 
> like the following one, then fails the read with a BlockMissingException like 
> below:
> {code}
> WARN hdfs.DFSClient: Failed to connect to /<dn_ip>:<dn_port> for file <file> 
> for block BP-531773307-<nn_ip>-1634685133591:blk_1073741826_1002, add to 
> deadNodes and continue. 
> java.io.EOFException: Unexpected EOF while trying to read response from server
>       at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:215)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:455)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>       at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:648)
>       at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2980)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:658)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:589)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:771)
>       at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
>       at java.io.DataInputStream.read(DataInputStream.java:100)
>       at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
>       at DfsClientTest3.main(DfsClientTest3.java:30)
> {code}
> {code}
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: 
> BP-813026743-<nn_ip>-1495248833293:blk_1139767762_66027405 file=/path/to/file
> {code}
>  
> The DataNode in the meantime logs the following:
> {code}
> ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
> <dn_host>:<dn_port>:DataXceiver error processing unknown operation  src: 
> /<client_ip>:<client_port> dst: /<dn_ip>:<dn_port>
> java.io.IOException: Version Mismatch (Expected: 28, Received: -8531 )
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:70)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:222)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> This happens only if the second client is connecting to the same DataNode as 
> the first one did, so might seem intermittent in case the clients are reading 
> different files, but happens always if the two client reads the same file 
> with replication factor 1.
> We ran into this issue during running HBase ExportSnapshot tool to move a 
> snapshot from a non-secure to a secure cluster, the issue is loosely related 
> to HBASE-12819 and HBASE-20433 and similar problems, I am linking these so 
> that HBase team will see how this is relevant for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-17975) Fallback to simple auth does not work for a secondary DistributedFileSystem instance

Reply via email to