[ https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825214#comment-17825214 ]
ASF GitHub Bot commented on HDFS-15413: --------------------------------------- hadoop-yetus commented on PR #5829: URL: https://github.com/apache/hadoop/pull/5829#issuecomment-1987902095 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 0m 19s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | |||| _ trunk Compile Tests _ | | +0 :ok: | mvndep | 14m 40s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 21m 13s | | trunk passed | | +1 :green_heart: | compile | 3m 2s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 3m 5s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 0m 48s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 19s | | trunk passed | | +1 :green_heart: | javadoc | 1m 8s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 48s | | trunk passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | -1 :x: | spotbugs | 1m 39s | [/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/7/artifact/out/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs-client-warnings.html) | hadoop-hdfs-project/hadoop-hdfs-client in trunk has 1 extant spotbugs warnings. | | +1 :green_heart: | shadedclient | 22m 50s | | branch has no errors when building and testing our client artifacts. | | -0 :warning: | patch | 23m 3s | | Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary. | |||| _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 24s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 1m 1s | | the patch passed | | +1 :green_heart: | compile | 2m 55s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 2m 55s | | the patch passed | | +1 :green_heart: | compile | 2m 56s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | javac | 2m 56s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/7/artifact/out/blanks-eol.txt) | The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply | | -0 :warning: | checkstyle | 0m 44s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/7/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 1 new + 45 unchanged - 0 fixed = 46 total (was 45) | | +1 :green_heart: | mvnsite | 1m 13s | | the patch passed | | +1 :green_heart: | javadoc | 0m 53s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 23s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 7s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | unit | 1m 54s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 235m 57s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 32s | | The patch does not generate ASF License warnings. | | | | 351m 39s | | | | Reason | Tests | |-------:|:------| | Failed junit tests | hadoop.hdfs.TestEncryptionZonesWithKMS | | | hadoop.hdfs.TestDFSStripedInputStreamWithTimeout | | | hadoop.hdfs.server.namenode.snapshot.TestRenameWithSnapshots | | | hadoop.hdfs.server.namenode.TestFSEditLogLoader | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.tools.TestDFSAdmin | | | hadoop.hdfs.TestDFSStripedOutputStreamWithRandomECPolicy | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5829 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux f03f54ac86d1 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 7e4992b1770732bb32a2f8a2e8224af490c11ee3 | | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/7/testReport/ | | Max. process+thread count | 4049 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-client hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/7/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > DFSStripedInputStream throws exception when datanodes close idle connections > ---------------------------------------------------------------------------- > > Key: HDFS-15413 > URL: https://issues.apache.org/jira/browse/HDFS-15413 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, erasure-coding, hdfs-client > Affects Versions: 3.1.3 > Environment: - Hadoop 3.1.3 > - erasure coding with ISA-L and RS-3-2-1024k scheme > - running in kubernetes > - dfs.client.socket-timeout = 10000 > - dfs.datanode.socket.write.timeout = 10000 > Reporter: Andrey Elenskiy > Priority: Critical > Labels: pull-request-available > Attachments: out.log > > > We've run into an issue with compactions failing in HBase when erasure coding > is enabled on a table directory. After digging further I was able to narrow > it down to a seek + read logic and able to reproduce the issue with hdfs > client only: > {code:java} > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.FSDataInputStream; > public class ReaderRaw { > public static void main(final String[] args) throws Exception { > Path p = new Path(args[0]); > int bufLen = Integer.parseInt(args[1]); > int sleepDuration = Integer.parseInt(args[2]); > int countBeforeSleep = Integer.parseInt(args[3]); > int countAfterSleep = Integer.parseInt(args[4]); > Configuration conf = new Configuration(); > FSDataInputStream istream = FileSystem.get(conf).open(p); > byte[] buf = new byte[bufLen]; > int readTotal = 0; > int count = 0; > try { > while (true) { > istream.seek(readTotal); > int bytesRemaining = bufLen; > int bufOffset = 0; > while (bytesRemaining > 0) { > int nread = istream.read(buf, 0, bufLen); > if (nread < 0) { > throw new Exception("nread is less than zero"); > } > readTotal += nread; > bufOffset += nread; > bytesRemaining -= nread; > } > count++; > if (count == countBeforeSleep) { > System.out.println("sleeping for " + sleepDuration + " > milliseconds"); > Thread.sleep(sleepDuration); > System.out.println("resuming"); > } > if (count == countBeforeSleep + countAfterSleep) { > System.out.println("done"); > break; > } > } > } catch (Exception e) { > System.out.println("exception on read " + count + " read total " > + readTotal); > throw e; > } > } > } > {code} > The issue appears to be due to the fact that datanodes close the connection > of EC client if it doesn't fetch next packet for longer than > dfs.client.socket-timeout. The EC client doesn't retry and instead assumes > that those datanodes went away resulting in "missing blocks" exception. > I was able to consistently reproduce with the following arguments: > {noformat} > bufLen = 1000000 (just below 1MB which is the size of the stripe) > sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000) > countBeforeSleep = 1 > countAfterSleep = 7 > {noformat} > I've attached the entire log output of running the snippet above against > erasure coded file with RS-3-2-1024k policy. And here are the logs from > datanodes of disconnecting the client: > datanode 1: > {noformat} > 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped > reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver > error processing READ_BLOCK operation src: /10.128.23.40:53748 dst: > /10.128.14.46:9866); java.net.SocketTimeoutException: 10000 millis timeout > while waiting for channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866 > remote=/10.128.23.40:53748] > {noformat} > datanode 2: > {noformat} > 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped > reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver > error processing READ_BLOCK operation src: /10.128.23.40:48772 dst: > /10.128.9.42:9866); java.net.SocketTimeoutException: 10000 millis timeout > while waiting for channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/10.128.9.42:9866 > remote=/10.128.23.40:48772] > {noformat} > datanode 3: > {noformat} > 2020-06-15 19:06:20,467 INFO datanode.DataNode: Likely the client has stopped > reading, disconnecting it (datanode-v11-3-hadoop.hadoop:9866:DataXceiver > error processing READ_BLOCK operation src: /10.128.23.40:57184 dst: > /10.128.16.13:9866); java.net.SocketTimeoutException: 10000 millis timeout > while waiting for channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/10.128.16.13:9866 > remote=/10.128.23.40:57184] > {noformat} > I've tried running the same code again non-ec files with replication of 3 and > was not able to reproduce the issue with any parameters. Looking through the > code, it's pretty clear that non-ec DFSInputStream retries reads after > exception: > https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L844 > Let me know if you need any more information that can help you out with > addressing this issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org