Re: data loss after cluster wide power loss

Dave Latham Sat, 06 Jul 2013 14:28:00 -0700

Thanks for the detailed information Michael, Colin, Suresh, Kiwhal.

It looks like we're on CentOS 5.7 (kernel 2.6.18-274.el5) so if the fix was
included in 5.4 then it sounds like we should have it.  I don't believe
we're using LVM.


It sounds like HDFS could improve handling of this scenario by fsyncing
directories after finalizing blocks or perhaps accepting blocksBeingWritten
that are reported later.  At least as an option for clients, like HBase,
that need to rely on durability.

Dave



On Wed, Jul 3, 2013 at 11:57 AM, Kihwal Lee <[email protected]> wrote:

> For the ext3 bug Colin mentioned, see
> https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
> 2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
> file data and affects NN more. Since NN preallocates blocks for edits,
> almost all data writes are done without modifying the file length. If
> fsync() loses race to pdflush, the edit log content won't be flushed. I am
> glad this bug was fixed.
>
> Regardless of this bug, renames will be journaled but not necessarily be
> committed before returning. If power fails before the transaction gets
> committed, the transaction can be lost. Journaling will make sure the file
> system will come back in a consistent state, but not all changes will be
> there. To make a rename durable, the directory can be opened and
> fsync()'ed. I think both source and destination are recorded in one
> transaction, so fsyncing on destination directory should be enough.
>
> When RWR blocks are reported, NN can probably accept and finalize it if
> the block size is as expected.
>
> Kihwal
>
>
>
>
>
>
> On 7/3/13 10:12 AM, "Colin McCabe" <[email protected]> wrote:
>
> >On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <[email protected]>
> >wrote:
> >> Dave,
> >>
> >> Thanks for the detailed email. Sorry I did not read all the details you
> >>had
> >> sent earlier completely (on my phone). As you said, this is not related
> >>to
> >> data loss related to HBase log and hsync. I think you are right; the
> >>rename
> >> operation itself might not have hit the disk. I think we should either
> >> ensure metadata operation is synced on the datanode or handle it being
> >> reported as blockBeingWritten. Let me spend sometime to debug this
> >>issue.
> >
> >In theory, ext3 is journaled, so all metadata operations should be
> >durable in the case of a power outage.  It is only data operations
> >that should be possible to lose.  It is the same for ext4.  (Assuming
> >you are not using nonstandard mount options.)
> >
> >In practice, it is possible that your hard disks didn't actually
> >persist the data that they said they did.  Rumor has it that some
> >drives ignore the SATA FLUSH CACHE command in some cases, since it
> >makes them look bad in benchmarks.  In that case, there is nothing the
> >filesystem or any other software can do.
> >
> >There was also a bug in older linux kernels where the kernel would not
> >actually send FLUSH CACHE.  Since the poster is using ext3 and
> >hadoop-1, it's possible he's also using an antique kernel as well.  I
> >know for sure this affected LVM-- it used to ignore barriers until
> >fairly recently.
> >
> >In Ceph, we used to recommend disabling the hard drive write cache if
> >your kernel was older than 2.6.33.  You can read the recommendation
> >for yourself here:
> >
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
> >/
> > This will have an impact on performance, however.
> >
> >An uninterruptable power supply is not a bad idea.
> >
> >I am curious:
> >what kernel version you are using?
> >are you using LVM?
> >
> >Colin
> >
> >
> >>
> >> One surprising thing is, all the replicas were reported as
> >> blockBeingWritten.
> >>
> >> Regards,
> >> Suresh
> >>
> >>
> >> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <[email protected]>
> wrote:
> >>>
> >>> (Removing hbase list and adding hdfs-dev list as this is pretty
> >>>internal
> >>> stuff).
> >>>
> >>> Reading through the code a bit:
> >>>
> >>> FSDataOutputStream.close calls
> >>> DFSOutputStream.close calls
> >>> DFSOutputStream.closeInternal
> >>>  - sets currentPacket.lastPacketInBlock = true
> >>>  - then calls
> >>> DFSOutputStream.flushInternal
> >>>  - enqueues current packet
> >>>  - waits for ack
> >>>
> >>> BlockReceiver.run
> >>>  - if (lastPacketInBlock && !receiver.finalized) calls
> >>> FSDataset.finalizeBlock calls
> >>> FSDataset.finalizeBlockInternal calls
> >>> FSVolume.addBlock calls
> >>> FSDir.addBlock calls
> >>> FSDir.addBlock
> >>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >>>dir
> >>>
> >>> This looks to me as I would expect a synchronous chain from a DFS
> >>>client
> >>> to moving the file from blocksBeingWritten to the current dir so that
> >>>once
> >>> the file is closed that it the block files would be in the proper
> >>>directory
> >>> - even if the contents of the file are still in the OS buffer rather
> >>>than
> >>> synced to disk.  It's only after this moving of blocks that
> >>> NameNode.complete file is called.  There are several conditions and
> >>>loops in
> >>> there that I'm not certain this chain is fully reliable in all cases
> >>>without
> >>> a greater understanding of the code.
> >>>
> >>> Could it be the case that the rename operation itself is not synced and
> >>> that ext3 lost the fact that the block files were moved?
> >>> Or is there a bug in the close file logic that for some reason the
> >>>block
> >>> files are not always moved into place when a file is closed?
> >>>
> >>> Thanks for your patience,
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <[email protected]>
> >>>wrote:
> >>>>
> >>>> Thanks for the response, Suresh.
> >>>>
> >>>> I'm not sure that I understand the details properly.  From my reading
> >>>>of
> >>>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>>>point
> >>>> in time it's writes so far hit the disk.  For example, for HBase it
> >>>>could
> >>>> apply a fsync after adding some edits to its WAL to ensure those
> >>>>edits are
> >>>> fully durable for a file which is still open.
> >>>>
> >>>> However, in this case the dfs file was closed and even renamed.  Is it
> >>>> the case that even after a dfs file is closed and renamed that the
> >>>>data
> >>>> blocks would still not be synced and would still be stored by the
> >>>>datanode
> >>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
> >>>>would it
> >>>> be better for the NameNode not to reject replicas that are in
> >>>> blocksBeingWritten, especially if it doesn't have any other replicas
> >>>> available?
> >>>>
> >>>> Dave
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
> >>>><[email protected]>
> >>>> wrote:
> >>>>>
> >>>>> Yes this is a known issue.
> >>>>>
> >>>>> The HDFS part of this was addressed in
> >>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> is
> >>>>> not
> >>>>> available in 1.x  release. I think HBase does not use this API yet.
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <[email protected]>
> >>>>>wrote:
> >>>>>
> >>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>>>Saturday
> >>>>> > the
> >>>>> > data center we were in had a total power failure and the cluster
> >>>>>went
> >>>>> > down
> >>>>> > hard.  When we brought it back up, HDFS reported 4 files as
> >>>>>CORRUPT.
> >>>>> > We
> >>>>> > recovered the data in question from our secondary datacenter, but
> >>>>>I'm
> >>>>> > trying to understand what happened and whether this is a bug in
> >>>>>HDFS
> >>>>> > that
> >>>>> > should be fixed.
> >>>>> >
> >>>>> > From what I can tell the file was created and closed by the dfs
> >>>>>client
> >>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
> >>>>>some
> >>>>> > other
> >>>>> > files containing the same data.  Then the cluster lost power.
> >>>>>After
> >>>>> > the
> >>>>> > cluster was restarted, the datanodes reported into the namenode but
> >>>>> > the
> >>>>> > blocks for this file appeared as "blocks being written" - the
> >>>>>namenode
> >>>>> > rejected them and the datanodes deleted the blocks.  At this point
> >>>>> > there
> >>>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>> > The
> >>>>> > underlying file systems are ext3.  Some questions that I would love
> >>>>> > get
> >>>>> > answers for if anyone with deeper understanding of HDFS can chime
> >>>>>in:
> >>>>> >
> >>>>> >  - Is this a known scenario where data loss is expected?  (I found
> >>>>> > HDFS-1539 but that seems different)
> >>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
> >>>>> > that
> >>>>> > happen before a file close operation is acknowledged to a hdfs
> >>>>>client?
> >>>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>>>current
> >>>>> > but
> >>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
> >>>>>of
> >>>>> > underlying file system behavior)?
> >>>>> >  - Is there any other explanation for how this can happen?
> >>>>> >
> >>>>> > Here is a sequence of selected relevant log lines from the RS
> >>>>>(HBase
> >>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> >>>>> > question).  It includes everything that mentions the block in
> >>>>>question
> >>>>> > in
> >>>>> > the NameNode and one DataNode log.  Please let me know if this more
> >>>>> > information that would be helpful.
> >>>>> >
> >>>>> > RS 2013-06-29 11:16:06,812 DEBUG
> >>>>>org.apache.hadoop.hbase.util.FSUtils:
> >>>>> > Creating
> >>>>> >
> >>>>> >
> >>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
> >>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > with permission=rwxrwxrwx
> >>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.allocateBlock:
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c.
> >>>>> > blk_1395839728632046111_357084589
> >>>>> > DN 2013-06-29 11:16:06,832 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>>> > 10.0.5.237:50010
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>>> > blk_1395839728632046111_357084589 of size 25418340 from
> >>>>> > /10.0.5.237:14327
> >>>>>
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
> >>>>>for
> >>>>> > block blk_1395839728632046111_357084589 terminating
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > Removing lease on  file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > DIR*
> >>>>> > NameSystem.completeFile: file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>>>file at
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
> >>>>>0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > to
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
> >>>>>30af6e64e56ba5a539fdf159c4c
> >>>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>>>compaction
> >>>>> > of 7
> >>>>> > file(s) in n of
> >>>>> >
> >>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>>> > into
> >>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
> >>>>>is
> >>>>> > 24.2m
> >>>>> >
> >>>>> > -------  CRASH, RESTART ---------
> >>>>> >
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>>>but
> >>>>> > was
> >>>>> > rejected: Reported as block being written but is a block of closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.1:50010
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> 16971264
> >>>>> > but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.5.237:50010
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
> >>>>>but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.24:50010
> >>>>> > (note the clock on the server running DN is wrong after restart.  I
> >>>>> > believe timestamps are off by 6 hours:)
> >>>>> > DN 2013-06-29 06:07:22,877 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>>> > blk_1395839728632046111_357084589 file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>>for
> >>>>> > deletion
> >>>>> > DN 2013-06-29 06:07:24,952 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>>> > blk_1395839728632046111_357084589 at file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>> >
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Dave
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://hortonworks.com/download/
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> http://hortonworks.com/download/
>
>

Re: data loss after cluster wide power loss

Reply via email to