On Fri, Jun 5, 2009 at 4:49 AM, Kirill Shabunov <[email protected]> wrote:
> > 1. I see some unprintable characters in the logs. I don't think it was > there before. I looked and found in > org.apache.hadoop.hbase.master.ProcessRegionOpen in line 82, LOG.info() > outputs "Bytes.toString(this.startCode)", which gives those non-printables. Thanks.... More logging was added in 0.19.4RC1. I made your suggested fixes in the branch. 2. Now, a more serious thing. Everything worked fine for a day. For > instance, it nicely survived a "kill -9" on a region with both ROOT and > META. After the timeout the regions were reassigned and the cluster resumed > normal operation. However, shortly after another regionserver was killed and > the cluster rebounded I found my tables could not be read because, for some > regions, HBase thinks the data is located in wrong files on DFS! Here is > what the client says now when trying to get data from one of the tables: First, hbase can lose data on crash because HDFS does not have a working flush. Sorry if we've not been clear that such is the case. A working flush will not be available until 0.21.0 HADOOP, it seems (though HADOOP-4379 is looking promising as a stopgap till 0.21.0 hadoop comes around). We'll keep the list up to date as we learn more. Whats happening below is covered in HBASE-1143. In short, regions are being doubly assigned. On crash, the .META. has lost some edits so its picture as to state of the cluster is off. I'm going to put up 0.19.4RC2 in a minute. It has some fixes that should help; there is the more frequent flushing mentioned in hbase-1143 but, as it turns out, we could also bungle our application of edits after crash so even if they weren't lost, they weren't being applied properly. Thanks for testing the RC Kirill. St.Ack > 09/06/05 12:56:58 ERROR hbase.CHBaseTest: I/O error occured. > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact > region server 192.168.37.26:60020 for region log_trace,,1243520531393, row > '', but failed after 10 attempts. > Exceptions: > java.io.IOException: java.io.IOException: HStoreScanner failed construction > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.<init>(StoreFileScanner.java:68) > at > org.apache.hadoop.hbase.regionserver.HStoreScanner.<init>(HStoreScanner.java:92) > at > org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2134) > at > org.apache.hadoop.hbase.regionserver.HRegion$HScanner.<init>(HRegion.java:2000) > at > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1187) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1714) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:632) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:912) > Caused by: java.io.FileNotFoundException: File does not exist: > hdfs://namenode:9000/hbase/log_trace/1009406857/f/mapfiles/7948177727169613520/data > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:415) > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:679) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1431) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1426) > at > org.apache.hadoop.hbase.io.MapFile$Reader.createDataFileReader(MapFile.java:327) > at > org.apache.hadoop.hbase.io.HBaseMapFile$HBaseReader.createDataFileReader(HBaseMapFile.java:95) > at org.apache.hadoop.hbase.io.MapFile$Reader.open(MapFile.java:309) > at > org.apache.hadoop.hbase.io.HBaseMapFile$HBaseReader.<init>(HBaseMapFile.java:78) > at > org.apache.hadoop.hbase.io.BloomFilterMapFile$Reader.<init>(BloomFilterMapFile.java:68) > at > org.apache.hadoop.hbase.regionserver.HStoreFile.getReader(HStoreFile.java:443) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.openReaders(StoreFileScanner.java:127) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.<init>(StoreFileScanner.java:65) > ... 10 more > > At this time /hbase/log_trace/1009406857/f/mapfiles contains > "7416235210013649080" subdir with seemingly normal data, but no > "7948177727169613520" subdir. > > Also, I see endless errors on the regionservers from > org.apache.hadoop.hbase.regionserver.CompactSplitThread complaining that > this or that file does not exist. > > It all seems to start about this moment: > > 2009-06-04 11:10:30,816 INFO org.apache.hadoop.hbase.regionserver.HRegion: > starting compaction on region > log_activity,20090527221746113test746698248,1243481739735 > 2009-06-04 11:10:46,517 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > log_activity,20090527221746113test746698248,1243481739735 in 15sec > 2009-06-04 11:10:46,517 INFO org.apache.hadoop.hbase.regionserver.HRegion: > starting compaction on region > log_data_exchange,20090528220725927test2142083448,1243570162509 > 2009-06-04 11:11:12,821 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > log_data_exchange,20090528220725927test2142083448,1243570162509 in 26sec > 2009-06-04 11:11:12,822 INFO org.apache.hadoop.hbase.regionserver.HRegion: > starting compaction on region > log_data_exchange-recipient_name,196732233120090528190946653test807696160,1243578229425 > 2009-06-04 11:11:33,899 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN: > ..META.,,1: safeMode=false > 2009-06-04 11:11:33,899 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: > .META.,,1: safeMode=false > 2009-06-04 11:11:34,839 INFO org.apache.hadoop.hbase.regionserver.HRegion: > region ..META.,,1/1028785192 available > 2009-06-04 11:11:44,874 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > log_data_exchange-recipient_name,196732233120090528190946653test807696160,1243578229425 > in 32sec > 2009-06-04 11:11:44,875 INFO org.apache.hadoop.hbase.regionserver.HRegion: > starting compaction on region > log_data_exchange,20090529115336764test675370772,1243644061166 > 2009-06-04 11:11:44,881 WARN org.apache.hadoop.hbase.regionserver.HStore: > Failed with java.io.FileNotFoundException: File does not exist: > hdfs://namenode:9000/hbase/log_data_exchange/2095948567/f/mapfiles/554805660147274743/data: > 2095948567/f/554805660147274743 > 2009-06-04 11:11:44,885 ERROR > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split > failed for region > log_data_exchange,20090529115336764test675370772,1243644061166 > java.io.FileNotFoundException: File does not exist: > hdfs://namenode:9000/hbase/log_data_exchange/2095948567/f/mapfiles/554805660147274743/data > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:415) > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:679) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1431) > at > org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1426) > at > org.apache.hadoop.hbase.io.MapFile$Reader.createDataFileReader(MapFile.java:327) > at > org.apache.hadoop.hbase.io.HBaseMapFile$HBaseReader.createDataFileReader(HBaseMapFile.java:95) > at org.apache.hadoop.hbase.io.MapFile$Reader.open(MapFile.java:309) > at > org.apache.hadoop.hbase.io.HBaseMapFile$HBaseReader.<init>(HBaseMapFile.java:78) > at > org.apache.hadoop.hbase.io.BloomFilterMapFile$Reader.<init>(BloomFilterMapFile.java:68) > at > org.apache.hadoop.hbase.regionserver.HStoreFile.getReader(HStoreFile.java:443) > at > org.apache.hadoop.hbase.regionserver.HStore.createReaders(HStore.java:1049) > at > org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:922) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:732) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:689) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:105) > 2009-06-04 11:11:44,887 INFO org.apache.hadoop.hbase.regionserver.HRegion: > starting compaction on region > log_activity,20090527160652393test2101793466,1243459595085 > 2009-06-04 11:12:07,218 INFO org.apache.hadoop.hbase.regionserver.HRegion: > compaction completed on region > log_activity,20090527160652393test2101793466,1243459595085 in 22sec > > > --Kirill >
