Hi Todd, Thank you, this is tremendously valuable input! I'll have to look in detail at each of these ten jiras, and will get back to the list with more info shortly. --Matt
On Fri, Sep 2, 2011 at 1:03 PM, Todd Lipcon <[email protected]> wrote: > The following other JIRAs have been committed in CDH for 18 months or > so, for the purpose of HBase. You may want to consider backporting > them as well - many were never committed to 0.20-append due to lack of > reviews by HDFS committers at the time. > > HDFS-1056. Fix possible multinode deadlocks during block recovery > when using ephemeral dataxceiv > > Description: Fixes the logic by which datanodes identify local RPC > targets > during block recovery for the case when the datanode > is configured with an ephemeral data transceiver port. > Reason: Potential internode deadlock for clusters using ephemeral ports > > > HADOOP-6722. Workaround a TCP spec quirk by not allowing > NetUtils.connect to connect to itself > > Description: TCP's ephemeral port assignment results in the possibility > that a client can connect back to its own outgoing socket, > resulting in failed RPCs or datanode transfers. > Reason: Fixes intermittent errors in cluster testing with ephemeral > IPC/transceiver ports on datanodes. > > HDFS-1122. Don't allow client verification to prematurely add > inprogress blocks to DataBlockScanner > > Description: When a client reads a block that is also open for writing, > it should not add it to the datanode block scanner. > If it does, the block scanner can incorrectly mark the > block as corrupt, causing data loss. > Reason: Potential dataloss with concurrent writer-reader case. > > HDFS-1248. Miscellaneous cleanup and improvements on 0.20 append branch > > Description: Miscellaneous code cleanup and logging changes, including: > - Slight cleanup to recoverFile() function in TestFileAppend4 > - Improve error messages on OP_READ_BLOCK > - Some comment cleanup in FSNamesystem > - Remove toInodeUnderConstruction (was not used) > - Add some checks for null blocks in FSNamesystem to avoid a possible > NPE > - Only log "inconsistent size" warnings at WARN level for > non-under-construction blocks. > - Redundant addStoredBlock calls are also not worthy of WARN level > - Add some extra information to a warning in ReplicationTargetChooser > Reason: Improves diagnosis of error cases and clarity of code > > > HDFS-1242. Add unit test for the appendFile race condition / > synchronization bug fixed in HDFS-142 > > Reason: Test coverage for previously applied patch. > > HDFS-1218. Replicas that are recovered during DN startup should > not be allowed to truncate better replicas. > > Description: If a datanode loses power and then recovers, its replicas > may be truncated due to the recovery of the local FS > journal. This patch ensures that a replica truncated by > a power loss does not truncate the block on HDFS. > Reason: Potential dataloss bug uncovered by power failure simulation > > HDFS-915. Write pipeline hangs for too long when ResponseProcessor > hits timeout > > Description: Previously, the write pipeline would hang for the entire > write > timeout when it encountered a read timeout (eg due to a > network connectivity issue). This patch interrupts the > writing > thread when a read error occurs. > Reason: Faster recovery from pipeline failure for HBase and other > interactive applications. > > > HDFS-1186. Writers should be interrupted when recovery is started, > not when it's completed. > > Description: When the write pipeline recovery process is initiated, this > interrupts any concurrent writers to the block under > recovery. > This prevents a case where some edits may be lost if the > writer has lost its lease but continues to write (eg due to > a garbage collection pause) > Reason: Fixes a potential dataloss bug > > > commit a960eea40dbd6a4e87072bdf73ac3b62e772f70a > Author: Todd Lipcon <[email protected]> > Date: Sun Jun 13 23:02:38 2010 -0700 > > HDFS-1197. Received blocks should not be added to block map > prematurely for under construction files > > Description: Fixes a possible dataloss scenario when using append() on > real-life clusters. Also augments unit tests to uncover > similar bugs in the future by simulating latency when > reporting blocks received by datanodes. > Reason: Append support dataloss bug > Author: Todd Lipcon > > > HDFS-1260. tryUpdateBlock should do validation before renaming meta file > > Description: Solves bug where block became inaccessible in certain > failure > conditions (particularly network partitions). Observed > under > HBase workload at user site. > Reason: Potential loss of syunced data when write pipeline fails > > > On Fri, Sep 2, 2011 at 11:20 AM, Suresh Srinivas <[email protected]> > wrote: > > I also propose following jiras, which are non append related bug fixes > from > > 0.20-append branch: > > > > - HDFS-1164. TestHdfsProxy is failing. > > - HDFS-1211. Block receiver should not log "rewind" packets at INFO > > level. > > - HDFS-1118. Fix socketleak on DFSClient. > > - HDFS-1210. DFSClient should log exception when block recovery fails. > > - HDFS-606. Fix ConcurrentModificationException in > > invalidateCorruptReplicas. > > - HDFS-561. Fix write pipeline READ_TIMEOUT. > > - HDFS-1202. DataBlockScanner throws NPE when updated before > > initialized. > > > > Risk Level: > > These are useful bugfixes from append branch and are not big changes to > the > > code base. > > > > These jiras have already been merged into 0.20-security branch. > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
