[
https://issues.apache.org/jira/browse/HDFS-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon updated HDFS-3950:
------------------------------
Attachment: hdfs-3950.txt
- Removes hardcoded timeout for attaining a quorum to write transactions. Now
configurable (default still 20sec)
- Change stringification of QuorumJournalManager so that the web UI readout
doesn't end up so wide. We used to print the URI, which was very wide. Now
there is a ", "-separated list of addresses, so it's able to wrap to multiple
lines and display nicer. Had to update a unit test or two for this.
- Change the buffer capacity for the QuorumOutputStream to match the behavior
of EditLogFileOutputStream (ie respects FSEditLog.setOutputBufferCapacity())
- Removed TODO:
{code}
- // TODO: check that md5s match up between any "tied" logs
{code}
We removed the md5sum field in HDFS-3943. When we add it back, we can add a
sanity check like this.
- Removed a couple TODOs which I replaced with comments rationalizing why the
current behavior does in fact work.
- Reduced verbose logging during newEpoch(). The verbose logging of newEpoch()
responses is now at DEBUG level, with a less verbose one at INFO level.
- Removed a bunch of unused imports in various files.
- Replace use of deprecated RPC.getServer with the new Builder interface from
Common.
- Address some TODOs in {{Journal.checkRequest}}. These are the most
interesting non-trivial changes from this patch:
-- Maintains the current IPC serial number and performs sanity checks that they
only increase in a given epoch. This is defensive against bugs in the IPC
layer, and also would defend against a potential bug where multiple writers got
assigned the same epoch.
-- Whenever we get an RPC from a new epoch (higher than lastPromisedEpoch), we
treat that as an explicit "promise" not to accept lower ones. This helps
tighten our sanity checks - we used to only assign lastPromisedEpoch as part of
the {{newEpoch()}} change, and strictly that's all that's necessary. But
re-assigning it on any higher-epoched RPC is extra-defensive.
- Include the client IP address in some of the more important INFO messages.
- Remove stale TODO:
{code}
- // TODO: right now, a recovery of a segment when the log is
- // completely emtpy (ie startLogSegment() but no txns)
- // will fail this assertion here, since endTxId < startTxId
{code}
There are lots of tests for this circumstance now - it's been long since fixed.
- Adds a few new sanity checks that I thought of while reviewing the code.
- Adds a fault injection point between where a logger downloads a log segment
and then persists the metadata about that log segment. I had a hunch there
might be a bug here, but it is successfully passing the tests, so I think it
turned out to not be a problem. The new fault injection point uses the same
strategy as CheckpointFaultInjector.
- Improves {{PersistentLongFile}} to not re-write the file when the value has
not changed.
I ran this through my cluster fault injection test and it passed. I also ran
findbugs and there are no issues found. Ran the full unit test suite for
qjournal and it passed.
> QJM: misc TODO cleanup, improved log messages, etc
> --------------------------------------------------
>
> Key: HDFS-3950
> URL: https://issues.apache.org/jira/browse/HDFS-3950
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ha
> Affects Versions: QuorumJournalManager (HDFS-3077)
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Minor
> Attachments: hdfs-3950.txt
>
>
> General JIRA for a bunch of miscellaneous clean-up in the QJM branch:
> - fix most remaining TODOs
> - improve some log/error messages
> - add some more sanity checks where appropriate
> - address any findbugs that might have crept into branch
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira