[
https://issues.apache.org/jira/browse/HDFS-7541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma updated HDFS-7541:
--------------------------
Attachment: UpgradeDomains_Design_v3.pdf
Thanks [~eddyxu]! These are very good points. Here is the updated design doc
that answers some of your questions in details. Please find specific replies
below.
bq. How about call it Availability Domain
Availability might be too general in this context. The service can become
unavailable due to unplanned event such as TOR outage or planned maintenance
such as software upgrade. Both can impact the availability. If we define
"Availability Domain" as "if all machines in that domain aren't available, the
service can still function", then machines belonging to one rack can also be
considered in one availability domain.
bq. Is this upgrade domain on each DN a soft state or a hard state?
It is a hard state, just like network location of the node. While admins likely
keep upgrade domain unchanged during common operations; the design allows
admins to move machines around as long as the machines are decommissioned
properly at the first place and thus when machines rejoin under different
upgrade domains, the proper replica will be removed. The updated design doc
provides more details on this.
bq. What do you anticipate as a good strategy to choose upgrade domains UDs?
Updated design doc has more on this. The number of upgrade domains has impact
on data loss, replica recovery time and rolling upgrade parallelism.
bq. Regarding the performance impact
# of racks is in the order of 100, # of upgrade domains is in the ballpark of
40, # of addBlocks operation is around 1000 ops / sec at leak.
bq. In design v2.pdf, could you mind to rephrase the process of "Replica delete
operation"?
Updated design adds more description.
bq. The last one maybe not relevant: would this design work well with erasure
coding (HDFS-7285)?
Similar question was asked in HDFS-7613, how we can reuse different block
placement policies. Like you said, we can address this issue separately.
> Upgrade Domains in HDFS
> -----------------------
>
> Key: HDFS-7541
> URL: https://issues.apache.org/jira/browse/HDFS-7541
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Ming Ma
> Attachments: HDFS-7541-2.patch, HDFS-7541.patch,
> SupportforfastHDFSdatanoderollingupgrade.pdf, UpgradeDomains_Design_v3.pdf,
> UpgradeDomains_design_v2.pdf
>
>
> Current HDFS DN rolling upgrade step requires sequential DN restart to
> minimize the impact on data availability and read/write operations. The side
> effect is longer upgrade duration for large clusters. This might be
> acceptable for DN JVM quick restart to update hadoop code/configuration.
> However, for OS upgrade that requires machine reboot, the overall upgrade
> duration will be too long if we continue to do sequential DN rolling restart.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)