To re-iterate, this is strictly a bug in HDFS whereby our write-ahead-logs
have the data, but HDFS isn't giving it up.

HADOOP-4379 promises to have a (slow) fix to this issue (makes recovery
slowish).

Hadoop 0.21 promises to have the whizbang solution to this, but still 2
months out?

-ryan

On Thu, Jun 18, 2009 at 11:54 AM, stack <[email protected]> wrote:

> We added noting in fileystem the vitals that could be used in a script
> reconstructing the .META. (see the .regioninfo file under each region dir
> in
> the filesystem).  We've not yet written up a script to reconstruct (Waiting
> on someone who  needs it badly enough I suppose).
>
> In 0.20.0, still no flush in hadoop.  There may be a workaround.  Will let
> list know if it proves viable (HADOOP-4379).  HDFS team have committed to a
> working flush in hadoop 0.21.
>
> St.Ack
>
> On Thu, Jun 18, 2009 at 11:35 AM, mike anderson <[email protected]
> >wrote:
>
> > 0.19.3, hdfs, 10 nodes fully distributed.
> >
> > Is there a way to rebuild what was lost (even partially)? will this
> problem
> > be fixed in 0.20?
> >
> >
> > On Thu, Jun 18, 2009 at 1:51 PM, stack <[email protected]> wrote:
> >
> > > You are on what version of hbase?
> > >
> > > My guess is its 0.19.x?
> > >
> > > How many nodes.  You using hdfs or local fs?
> > >
> > > The log below doesn't show issues.
> > >
> > > So, as to what happened, I speculate that you loaded up your table and
> > then
> > > there was some issue -- did you up your file descriptors, xceivers,
> etc?
> > --
> > > that caused the hang but uploads, in particular the edits that included
> > > creation of your table and addition table regions had not been
> persisted.
> > > The hungup hbase and your kill -9 -- there is nothing else you can do
> > when
> > > it won't respond though you could try ./bin/hbase-daemon.sh stop
> > > regionserver on each of your regionservers to try and bring them down
> > > nicely
> > > -- meant the catalog table edits were lost so it appears your table is
> > lost
> > > (HDFS does not have a working flush/sync/append in hadoop 0.19.x so
> hbase
> > > can lose data).
> > >
> > > In the head of the 0.19 branch we've done stuff to make the window
> > whereby
> > > we lose edits narrower (.META. flushes every few k or so).  I need to
> put
> > > up
> > > a 0.19.4 release candidate (I'm held up by my tracing a new issue here
> on
> > > our home cluster).
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Jun 18, 2009 at 9:10 AM, mike anderson <[email protected]
> > > >wrote:
> > >
> > > > I had about 30,000 rows in my table 'cached_parsedtext'.  This
> morning
> > > when
> > > > I checked, Hbase appeared to be down (master server web UI was not
> > > > responding and the Shell crashed when I tried to count rows). I tried
> > > doing
> > > > a nice shutdown via bin/stop-hbase, this hung for about 20 minutes
> > though
> > > > so
> > > > I gave up and did a kill -9 on the hbase processes (what else was I
> > > > supposed
> > > > to do!?). Upon restarting I discovered that all of the rows were
> gone.
> > I
> > > > browsed the filesystem and saw that some of the metadata still
> existed
> > in
> > > > hadoop dfs. Is there a way to rebuild the table? (After the force
> kill
> > I
> > > > also did a nice restart of hbase and hadoop -- same results)
> > > >
> > > > A few of the relevent looking log files are included below for those
> > that
> > > > speak the language. However, these don't really mean much to me.
> > > >
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ServerManager: Received MSG_REPORT_OPEN:
> > > > cached_parsedtext,,1244838542607: safeMode=false fr
> > > > om 10.0.16.91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open
> on
> > > > 10.0.16.91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ProcessRegionOpen$1: updating row
> > > > cached_parsedtext,,1244838542607
> > > > in region .META.,,1 with
> > > > startcode 1245337882941 and server 10.0.16.91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.RegionManager: assigning region
> > > cached_parsedtext,,1244838542607
> > > > to the only server 10.0.16.
> > > > 91:60020
> > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823
> > INFO
> > > > org.apache.hadoop.hba
> > > > se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN:
> > > > cached_parsedtext,,1244838542607: safeMode=
> > > > false from 10.0.16.91:60020
> > > >
> > > >
> > > >
> > > >
> > > > Ideally I'd love to get my table back, but if not, learning how to
> > avoid
> > > > this in the future would be great.
> > > >
> > > >
> > > > Thanks in advance,
> > > > Mike
> > > >
> > >
> >
>

Reply via email to