[
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742165#comment-17742165
]
ASF GitHub Bot commented on HDFS-15413:
---------------------------------------
hadoop-yetus commented on PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#issuecomment-1631451493
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 49s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 1s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available.
|
| +0 :ok: | xmllint | 0m 1s | | xmllint was not available. |
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include
any new or modified tests. Please justify why no new tests are needed for this
patch. Also please list what manual steps were performed to verify this patch.
|
|||| _ trunk Compile Tests _ |
| +0 :ok: | mvndep | 15m 55s | | Maven dependency ordering for branch |
| +1 :green_heart: | mvninstall | 37m 31s | | trunk passed |
| +1 :green_heart: | compile | 6m 12s | | trunk passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | compile | 5m 55s | | trunk passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | checkstyle | 1m 30s | | trunk passed |
| +1 :green_heart: | mvnsite | 2m 24s | | trunk passed |
| +1 :green_heart: | javadoc | 1m 53s | | trunk passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | javadoc | 2m 17s | | trunk passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | spotbugs | 6m 2s | | trunk passed |
| +1 :green_heart: | shadedclient | 44m 7s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +0 :ok: | mvndep | 0m 31s | | Maven dependency ordering for patch |
| +1 :green_heart: | mvninstall | 2m 7s | | the patch passed |
| +1 :green_heart: | compile | 6m 25s | | the patch passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | javac | 6m 25s | | the patch passed |
| +1 :green_heart: | compile | 6m 6s | | the patch passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | javac | 6m 6s | | the patch passed |
| -1 :x: | blanks | 0m 0s |
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/1/artifact/out/blanks-eol.txt)
| The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix
<<patch_file>>. Refer https://git-scm.com/docs/git-apply |
| -0 :warning: | checkstyle | 1m 27s |
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/1/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
| hadoop-hdfs-project: The patch generated 8 new + 45 unchanged - 0 fixed =
53 total (was 45) |
| +1 :green_heart: | mvnsite | 2m 6s | | the patch passed |
| +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 |
| +1 :green_heart: | javadoc | 2m 5s | | the patch passed with JDK
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| +1 :green_heart: | spotbugs | 6m 8s | | the patch passed |
| +1 :green_heart: | shadedclient | 40m 45s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 2m 22s | | hadoop-hdfs-client in the patch
passed. |
| -1 :x: | unit | 253m 30s |
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch passed. |
| +1 :green_heart: | asflicense | 0m 52s | | The patch does not
generate ASF License warnings. |
| | | 451m 11s | | |
| Reason | Tests |
|-------:|:------|
| Failed junit tests | hadoop.hdfs.server.namenode.ha.TestObserverNode |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.43 ServerAPI=1.43 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/1/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/5829 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
| uname | Linux 2fa23e2f4762 4.15.0-212-generic #223-Ubuntu SMP Tue May 23
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / 5366ce9e970d6c2b849a2a4b2a2d831923ded3d7 |
| Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| Multi-JDK versions |
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/1/testReport/ |
| Max. process+thread count | 2328 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-client
hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5829/1/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
> DFSStripedInputStream throws exception when datanodes close idle connections
> ----------------------------------------------------------------------------
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: ec, erasure-coding, hdfs-client
> Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 10000
> - dfs.datanode.socket.write.timeout = 10000
> Reporter: Andrey Elenskiy
> Priority: Critical
> Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding
> is enabled on a table directory. After digging further I was able to narrow
> it down to a seek + read logic and able to reproduce the issue with hdfs
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
> while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
> int nread = istream.read(buf, 0, bufLen);
> if (nread < 0) {
> throw new Exception("nread is less than zero");
> }
> readTotal += nread;
> bufOffset += nread;
> bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + "
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
> }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total "
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection
> of EC client if it doesn't fetch next packet for longer than
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 1000000 (just below 1MB which is the size of the stripe)
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against
> erasure coded file with RS-3-2-1024k policy. And here are the logs from
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver
> error processing READ_BLOCK operation src: /10.128.23.40:53748 dst:
> /10.128.14.46:9866); java.net.SocketTimeoutException: 10000 millis timeout
> while waiting for channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.128.14.46:9866
> remote=/10.128.23.40:53748]
> {noformat}
> datanode 2:
> {noformat}
> 2020-06-15 19:06:20,341 INFO datanode.DataNode: Likely the client has stopped
> reading, disconnecting it (datanode-v11-1-hadoop.hadoop:9866:DataXceiver
> error processing READ_BLOCK operation src: /10.128.23.40:48772 dst:
> /10.128.9.42:9866); java.net.SocketTimeoutException: 10000 millis timeout
> while waiting for channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.128.9.42:9866
> remote=/10.128.23.40:48772]
> {noformat}
> datanode 3:
> {noformat}
> 2020-06-15 19:06:20,467 INFO datanode.DataNode: Likely the client has stopped
> reading, disconnecting it (datanode-v11-3-hadoop.hadoop:9866:DataXceiver
> error processing READ_BLOCK operation src: /10.128.23.40:57184 dst:
> /10.128.16.13:9866); java.net.SocketTimeoutException: 10000 millis timeout
> while waiting for channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/10.128.16.13:9866
> remote=/10.128.23.40:57184]
> {noformat}
> I've tried running the same code again non-ec files with replication of 3 and
> was not able to reproduce the issue with any parameters. Looking through the
> code, it's pretty clear that non-ec DFSInputStream retries reads after
> exception:
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L844
> Let me know if you need any more information that can help you out with
> addressing this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]