[
https://issues.apache.org/jira/browse/HDFS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964620#comment-14964620
]
Walter Su commented on HDFS-9079:
---------------------------------
1.
bq. DataStreamer#shouldStop should check internal error instead of all errors.
Otherwise there could be an infinite loop if a streamer is assigned an external
error before initializing its own blockStream.
shouldStop() only stop sending current packet. You shouldn't change it.
Instead, you should make sure blockStream is not null when set external error.
That's how HDFS-9040 did.
2. Thread {{coordinator}} leaks. See {{allocateNewBlock()}}. Instead of using
{{getEndedBlock()}}, you can stop coordinator thread properly and {{join()}}
the thread.
3. The 2 healthy check duplicates, in view of code readability. Even though it
functions well.
{code}
if (current.isHealthy()) { //writeChunk(..)
...
} else if (coordinator.getStreamerStatus(current.getIndex()) ==
BlockMetadataCoordinator.StreamerStatus.RUNNING){
{code}
4. You insert a empty closed streamer. Why not just use the new concept
{{StreamerStatus.NULL}}. The issue is like #3 above.
{code}
StripedDataStreamer streamer = new StripedDataStreamer(stat,
dfsClient, src, progress, checksum, cachingStrategy, byteArrayManager,
favoredNodes, i, coordinator, blocks[i]);
streamers.set(i, streamer);
if (blocks[i] != null) {
streamer.start();
} else {
streamer.close(false);
}
{code}
5.
bq. It's a very good point that the current patch doesn't handle failures of
the streamer threads. Since the change is already quite large, maybe we can
leave that as a separate JIRA, if we at least agree on the basic direction of
this JIRA?
Agree above. And do you agree to use timeout waiting in coordinator.run() ? If
so, we can use coordinator to postpone setExternalError() to address #1.
> Erasure coding: preallocate multiple generation stamps and serialize updates
> from data streamers
> ------------------------------------------------------------------------------------------------
>
> Key: HDFS-9079
> URL: https://issues.apache.org/jira/browse/HDFS-9079
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: erasure-coding
> Affects Versions: HDFS-7285
> Reporter: Zhe Zhang
> Assignee: Zhe Zhang
> Attachments: HDFS-9079-HDFS-7285.00.patch, HDFS-9079.01.patch,
> HDFS-9079.02.patch, HDFS-9079.03.patch, HDFS-9079.04.patch
>
>
> A non-striped DataStreamer goes through the following steps in error handling:
> {code}
> 1) Finds error => 2) Asks NN for new GS => 3) Gets new GS from NN => 4)
> Applies new GS to DN (createBlockOutputStream) => 5) Ack from DN => 6)
> Updates block on NN
> {code}
> To simplify the above we can preallocate GS when NN creates a new striped
> block group ({{FSN#createNewBlock}}). For each new striped block group we can
> reserve {{NUM_PARITY_BLOCKS}} GS's. Then steps 1~3 in the above sequence can
> be saved. If more than {{NUM_PARITY_BLOCKS}} errors have happened we
> shouldn't try to further recover anyway.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)