[
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276167#comment-17276167
]
Hongbing Wang commented on HDFS-15779:
--------------------------------------
[~ferhui] Thanks for review!
From the structural point of view, using *if (targetsStatus[i])* is the best,
but I was worried that there would be problems.
Because the status of targetsStatus[i] may be changed in
_StripedWriter#transferData2Targets_, it will cause targetsStatus[i] and
writer[i] to not correspond one to one. Note that they correspond before this.
{code:java}
// StripedWriter#transferData2Targets
int transferData2Targets() {
int nSuccess = 0;
for (int i = 0; i < targets.length; i++) {
if (targetsStatus[i]) {
boolean success = false;
try {
writers[i].transferData2Target(packetBuf);
nSuccess++;
success = true;
} catch (IOException e) {
LOG.warn(e.getMessage());
}
targetsStatus[i] = success; // may be false here
}
}
return nSuccess;
}
{code}
If _transferData2Target()_ throws IOException, _writer[i]_ may still need to
call _clearBuffers_(), I think. Is that so?
Thanks again.
> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -------------------------------------------------------------------------
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.2.0
> Reporter: Hongbing Wang
> Assignee: Hongbing Wang
> Priority: Major
> Attachments: HDFS-15779.001.patch
>
>
> The NullPointerException in DN log as follows:
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> Failed to reconstruct striped block:
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> Receiving
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
> for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
> targetBuffer.clear();
> }
> }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and
> when reconstruct() is called, as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
> try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init(); //①
> reconstruct(); //②
> stripedWriter.endTargetBlocks();
> } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` ->
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
> int nSuccess = 0;
> for (short i = 0; i < targets.length; i++) {
> try {
> writers[i] = createWriter(i);
> nSuccess++;
> targetsStatus[i] = true;
> } catch (Throwable e) {
> LOG.warn(e.getMessage());
> }
> }
> return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and 0 < nSuccess <
> targets.length.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]