To re-iterate, this is strictly a bug in HDFS whereby our write-ahead-logs have the data, but HDFS isn't giving it up.
HADOOP-4379 promises to have a (slow) fix to this issue (makes recovery slowish). Hadoop 0.21 promises to have the whizbang solution to this, but still 2 months out? -ryan On Thu, Jun 18, 2009 at 11:54 AM, stack <[email protected]> wrote: > We added noting in fileystem the vitals that could be used in a script > reconstructing the .META. (see the .regioninfo file under each region dir > in > the filesystem). We've not yet written up a script to reconstruct (Waiting > on someone who needs it badly enough I suppose). > > In 0.20.0, still no flush in hadoop. There may be a workaround. Will let > list know if it proves viable (HADOOP-4379). HDFS team have committed to a > working flush in hadoop 0.21. > > St.Ack > > On Thu, Jun 18, 2009 at 11:35 AM, mike anderson <[email protected] > >wrote: > > > 0.19.3, hdfs, 10 nodes fully distributed. > > > > Is there a way to rebuild what was lost (even partially)? will this > problem > > be fixed in 0.20? > > > > > > On Thu, Jun 18, 2009 at 1:51 PM, stack <[email protected]> wrote: > > > > > You are on what version of hbase? > > > > > > My guess is its 0.19.x? > > > > > > How many nodes. You using hdfs or local fs? > > > > > > The log below doesn't show issues. > > > > > > So, as to what happened, I speculate that you loaded up your table and > > then > > > there was some issue -- did you up your file descriptors, xceivers, > etc? > > -- > > > that caused the hang but uploads, in particular the edits that included > > > creation of your table and addition table regions had not been > persisted. > > > The hungup hbase and your kill -9 -- there is nothing else you can do > > when > > > it won't respond though you could try ./bin/hbase-daemon.sh stop > > > regionserver on each of your regionservers to try and bring them down > > > nicely > > > -- meant the catalog table edits were lost so it appears your table is > > lost > > > (HDFS does not have a working flush/sync/append in hadoop 0.19.x so > hbase > > > can lose data). > > > > > > In the head of the 0.19 branch we've done stuff to make the window > > whereby > > > we lose edits narrower (.META. flushes every few k or so). I need to > put > > > up > > > a 0.19.4 release candidate (I'm held up by my tracing a new issue here > on > > > our home cluster). > > > > > > St.Ack > > > > > > > > > > > > > > > > > > On Thu, Jun 18, 2009 at 9:10 AM, mike anderson <[email protected] > > > >wrote: > > > > > > > I had about 30,000 rows in my table 'cached_parsedtext'. This > morning > > > when > > > > I checked, Hbase appeared to be down (master server web UI was not > > > > responding and the Shell crashed when I tried to count rows). I tried > > > doing > > > > a nice shutdown via bin/stop-hbase, this hung for about 20 minutes > > though > > > > so > > > > I gave up and did a kill -9 on the hbase processes (what else was I > > > > supposed > > > > to do!?). Upon restarting I discovered that all of the rows were > gone. > > I > > > > browsed the filesystem and saw that some of the metadata still > existed > > in > > > > hadoop dfs. Is there a way to rebuild the table? (After the force > kill > > I > > > > also did a nice restart of hbase and hadoop -- same results) > > > > > > > > A few of the relevent looking log files are included below for those > > that > > > > speak the language. However, these don't really mean much to me. > > > > > > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 > > INFO > > > > org.apache.hadoop.hba > > > > se.master.ServerManager: Received MSG_REPORT_OPEN: > > > > cached_parsedtext,,1244838542607: safeMode=false fr > > > > om 10.0.16.91:60020 > > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,038 > > INFO > > > > org.apache.hadoop.hba > > > > se.master.ProcessRegionOpen$1: cached_parsedtext,,1244838542607 open > on > > > > 10.0.16.91:60020 > > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:12:42,039 > > INFO > > > > org.apache.hadoop.hba > > > > se.master.ProcessRegionOpen$1: updating row > > > > cached_parsedtext,,1244838542607 > > > > in region .META.,,1 with > > > > startcode 1245337882941 and server 10.0.16.91:60020 > > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:31,595 > > INFO > > > > org.apache.hadoop.hba > > > > se.master.RegionManager: assigning region > > > cached_parsedtext,,1244838542607 > > > > to the only server 10.0.16. > > > > 91:60020 > > > > logs/hbase-pubget-master-carr.domain.com.log:2009-06-18 11:31:34,823 > > INFO > > > > org.apache.hadoop.hba > > > > se.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: > > > > cached_parsedtext,,1244838542607: safeMode= > > > > false from 10.0.16.91:60020 > > > > > > > > > > > > > > > > > > > > Ideally I'd love to get my table back, but if not, learning how to > > avoid > > > > this in the future would be great. > > > > > > > > > > > > Thanks in advance, > > > > Mike > > > > > > > > > >
