[
https://issues.apache.org/jira/browse/HDFS-10275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lin Yiqun updated HDFS-10275:
-----------------------------
Description:
The unit test {{TestDataNodeMetrics}} fails intermittently. The failed info
show these:
{code}
Results :
Failed tests:
TestDataNodeVolumeFailureToleration.testVolumeAndTolerableConfiguration:195->testVolumeConfig:232
expected:<false> but was:<true>
Tests in error:
TestOpenFilesWithSnapshot.testWithCheckpoint:94 ? IO Timed out waiting for
Min...
TestDataNodeMetrics.testDataNodeTimeSpend:279 ? Timeout Timed out waiting for
...
TestHFlush.testHFlushInterrupted ? IO The stream is closed
{code}
In line 279 in {{TestDataNodeMetrics}}, it takes place timed out. Then I looked
into the code and found the real reason is that the metric of
{{TotalWriteTime}} frequently count 0 in each iteration of creating file. And
the this leads to retry operations till timeout.
I debug the test in my local. I found the most suspect reason which cause
{{TotalWriteTime}} metric count always be 0 is that we using the
{{SimulatedFSDataset}} for spending time test. In {{SimulatedFSDataset}}, it
will use the inner class's method {{SimulatedOutputStream#write}} to count the
write time and the method of this class just updates the {{length}} and throws
its data away.
{code}
@Override
public void write(byte[] b,
int off,
int len) throws IOException {
length += len;
}
{code}
So the writing operation hardly not costs any time. So we should use a real way
to create file instead of simulated way. I have tested in my local that the
test is passed just one time when I delete the simulated way, while the test
retries many times to count write time in old way.
was:
The unit test {{TestDataNodeMetrics}} fails intermittently. The failed info
show these:
{code}
Results :
Failed tests:
TestDataNodeVolumeFailureToleration.testVolumeAndTolerableConfiguration:195->testVolumeConfig:232
expected:<false> but was:<true>
Tests in error:
TestOpenFilesWithSnapshot.testWithCheckpoint:94 ? IO Timed out waiting for
Min...
TestDataNodeMetrics.testDataNodeTimeSpend:279 ? Timeout Timed out waiting for
...
TestHFlush.testHFlushInterrupted ? IO The stream is closed
{code}
In line 279 in {{TestDataNodeMetrics}}, it takes place timed out. Then I looked
into the code and found the real reason is that the metric of
{{TotalWriteTime}} frequently count 0 in each iteration of creating file. And
the this leads to retry operations till timeout.
I debug the test in my local. I found the most suspect reason whic cause
{{TotalWriteTime}} metric count always be 0 is that we using the
{{SimulatedFSDataset}} for spending time test. In {{SimulatedFSDataset}}, it
will use the inner class's method {{SimulatedOutputStream#write}} to count the
write time and the method of this class just updates the {{length}} and throws
its data away.
{code}
@Override
public void write(byte[] b,
int off,
int len) throws IOException {
length += len;
}
{code}
So the writing operation hardly not costs any time. So we should use a real way
to create file instead of simulated way. I have tested in my local that the
test is passed just one time when I delete the simulated way, while the test
retries many times to count write time in old way.
> TestDataNodeMetrics failing intermittently due to TotalWriteTime counted
> incorrectly
> ------------------------------------------------------------------------------------
>
> Key: HDFS-10275
> URL: https://issues.apache.org/jira/browse/HDFS-10275
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: test
> Reporter: Lin Yiqun
> Assignee: Lin Yiqun
>
> The unit test {{TestDataNodeMetrics}} fails intermittently. The failed info
> show these:
> {code}
> Results :
> Failed tests:
>
> TestDataNodeVolumeFailureToleration.testVolumeAndTolerableConfiguration:195->testVolumeConfig:232
> expected:<false> but was:<true>
> Tests in error:
> TestOpenFilesWithSnapshot.testWithCheckpoint:94 ? IO Timed out waiting for
> Min...
> TestDataNodeMetrics.testDataNodeTimeSpend:279 ? Timeout Timed out waiting
> for ...
> TestHFlush.testHFlushInterrupted ? IO The stream is closed
> {code}
> In line 279 in {{TestDataNodeMetrics}}, it takes place timed out. Then I
> looked into the code and found the real reason is that the metric of
> {{TotalWriteTime}} frequently count 0 in each iteration of creating file. And
> the this leads to retry operations till timeout.
> I debug the test in my local. I found the most suspect reason which cause
> {{TotalWriteTime}} metric count always be 0 is that we using the
> {{SimulatedFSDataset}} for spending time test. In {{SimulatedFSDataset}}, it
> will use the inner class's method {{SimulatedOutputStream#write}} to count
> the write time and the method of this class just updates the {{length}} and
> throws its data away.
> {code}
> @Override
> public void write(byte[] b,
> int off,
> int len) throws IOException {
> length += len;
> }
> {code}
> So the writing operation hardly not costs any time. So we should use a real
> way to create file instead of simulated way. I have tested in my local that
> the test is passed just one time when I delete the simulated way, while the
> test retries many times to count write time in old way.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)