Dave Latham created HBASE-8778:
----------------------------------

             Summary: Region assigments scan table directory making them slow 
for huge tables
                 Key: HBASE-8778
                 URL: https://issues.apache.org/jira/browse/HBASE-8778
             Project: HBase
          Issue Type: Improvement
            Reporter: Dave Latham
         Attachments: HBASE-8778-0.94.5.patch

On a table with 130k regions it takes about 3 seconds for a region server to 
open a region once it has been assigned.

Watching the threads for a region server running 0.94.5 that is opening many 
such regions shows the thread opening the reigon in code like this:
{noformat}
"PRI IPC Server handler 4 on 60020" daemon prio=10 tid=0x00002aaac07e9000 
nid=0x6566 runnable [0x000000004c46d000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.String.indexOf(String.java:1521)
        at java.net.URI$Parser.scan(URI.java:2912)
        at java.net.URI$Parser.parse(URI.java:3004)
        at java.net.URI.<init>(URI.java:736)
        at org.apache.hadoop.fs.Path.initialize(Path.java:145)
        at org.apache.hadoop.fs.Path.<init>(Path.java:126)
        at org.apache.hadoop.fs.Path.<init>(Path.java:50)
        at 
org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:215)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSystem.java:252)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:311)
        at 
org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:159)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:867)
        at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1168)
        at 
org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:269)
        at 
org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:255)
        at 
org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoModtime(FSTableDescriptors.java:368)
        at 
org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:155)
        at 
org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:126)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2834)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2807)
        at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
{noformat}

To open the region, the region server first loads the latest HTableDescriptor.  
Since HBASE-4553 HTableDescriptor's are stored in the file system at 
"/hbase/<tableDir>/.tableinfo.<sequenceNum>".  The file with the largest 
sequenceNum is the current descriptor.  This is done so that the current 
descirptor is updated atomically.  However, since the filename is not known in 
advance FSTableDescriptors it has to do a FileSystem.listStatus operation which 
has to list all files in the directory to find it.  The directory also contains 
all the region directories, so in our case it has to load 130k FileStatus 
objects.  Even using a globStatus matching function still transfers all the 
objects to the client before performing the pattern matching.  Furthermore HDFS 
uses a default of transferring 1000 directory entries in each RPC call, so it 
requires 130 roundtrips to the namenode to fetch all the directory entries.

Consequently, to reassign all the regions of a table (or a constant fraction 
thereof) requires time proportional to the square of the number of regions.

In our case, if a region server fails with 200 such regions, it takes 10+ 
minutes for them all to be reassigned, after the zk expiration and log 
splitting.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to