[ 
https://issues.apache.org/jira/browse/HDFS-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7541:
--------------------------
    Attachment: UpgradeDomains_Design_v3.pdf

Thanks [~eddyxu]! These are very good points. Here is the updated design doc 
that answers some of your questions in details. Please find specific replies 
below.

bq. How about call it Availability Domain
Availability might be too general in this context. The service can become 
unavailable due to unplanned event such as TOR outage or planned maintenance 
such as software upgrade. Both can impact the availability. If we define 
"Availability Domain" as "if all machines in that domain aren't available, the 
service can still function", then machines belonging to one rack can also be 
considered in one availability domain.

bq. Is this upgrade domain on each DN a soft state or a hard state?
It is a hard state, just like network location of the node. While admins likely 
keep upgrade domain unchanged during common operations; the design allows 
admins to move machines around as long as the machines are decommissioned 
properly at the first place and thus when machines rejoin under different 
upgrade domains, the proper replica will be removed. The updated design doc 
provides more details on this.

bq. What do you anticipate as a good strategy to choose upgrade domains UDs?
Updated design doc has more on this. The number of upgrade domains has impact 
on data loss, replica recovery time and rolling upgrade parallelism.

bq. Regarding the performance impact
# of racks is in the order of 100, # of upgrade domains is in the ballpark of 
40,  # of addBlocks operation is around 1000 ops / sec at leak.

bq. In design v2.pdf, could you mind to rephrase the process of "Replica delete 
operation"?
Updated design adds more description.

bq. The last one maybe not relevant: would this design work well with erasure 
coding (HDFS-7285)?
Similar question was asked in HDFS-7613, how we can reuse different block 
placement policies. Like you said, we can address this issue separately.

> Upgrade Domains in HDFS
> -----------------------
>
>                 Key: HDFS-7541
>                 URL: https://issues.apache.org/jira/browse/HDFS-7541
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>         Attachments: HDFS-7541-2.patch, HDFS-7541.patch, 
> SupportforfastHDFSdatanoderollingupgrade.pdf, UpgradeDomains_Design_v3.pdf, 
> UpgradeDomains_design_v2.pdf
>
>
> Current HDFS DN rolling upgrade step requires sequential DN restart to 
> minimize the impact on data availability and read/write operations. The side 
> effect is longer upgrade duration for large clusters. This might be 
> acceptable for DN JVM quick restart to update hadoop code/configuration. 
> However, for OS upgrade that requires machine reboot, the overall upgrade 
> duration will be too long if we continue to do sequential DN rolling restart.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to