[jira] [Commented] (HBASE-8778) Region assigments scan table directory making them slow for huge tables

Dave Latham (JIRA) Sat, 20 Jul 2013 07:57:18 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714446#comment-13714446
 ]


Dave Latham commented on HBASE-8778:
------------------------------------

Thanks, [~lhofhansl], for taking a look.

{quote}+1 on doing the known-subdir approach in 0.94 and a tables-table in 
0.96+.{quote}

Actually what I was proposing in my comment on 17/Jul/13 was to not commit for 
0.94, commit a known-subdir for 0.96 and opening a new tables-table JIRA for 
0.96+.  A couple reasons behind that.  First - the patch is not purely rolling 
compatible.  Break the code down into writers (Master, tools like hbck, merge, 
compact) and readers (everything).  If one writer is updated and writes in the 
new way, then an old writer does an update in the old location only, then new 
readers will miss that update.  So the requirement is to make no writes from 
old writers once you upgrade to a new writer - I'm not sure if we can/should 
make that a requirement for a rolling upgrade to 0.94.  I'm not sure if there's 
a way around that.  Also as you noted the patch involved additional cleanup and 
refactoring.  If you want to see it go into 0.94 I'm game to explore it 
further, but I'm also content to push for 0.96 and leave the 0.94 patch here 
for interested parties.

{quote}Could you run the full test suite with the patch attached? Have you been 
using this patch in any production environment, yet?{quote}

Yes, the tests pass for 0.94.5 + HBASE-8778-0.94.5-v2.patch.  And as Ian noted 
we've seen great results in production.

{quote}Is there a minimal version of the patch? It seems to include some extra 
clean up.{quote}

Don't currently have a minimal version of the patch.  Because there is now a 
wait for a lock, I introduced a Configuration to be able to adjust the lock 
wait time.  Also as I was putting this together I noticed that some clients are 
accessing static methods and others using instance methods, without much 
reason, that there is a fsreadonly field intended to prevent file system 
changes when set but that is not (cannot) be checked by the static methods and 
even some instance methods call those static methods which then ignore the 
field thus losing the guarantee.  The patch changes everything to require an 
instance, correctly enforces the fsreadonly flag everywhere and can use the 
instance's Configuration.  It also adds a great deal more commenting to make 
things more clear for the next maintainer.

{quote}There are some extra exceptions thrown now, that might throw off 
existing code. Could do this in 0.96+.{quote}

Examining them - FSTableDescriptors.add now throws NotImplementedException if 
in fsreadonly mode instead of silently failing to add the descriptor.  It's not 
called by any current code that sets fsreadonly (RegionServer) but I believe if 
new code is added that does call it that failing loudly is better than failing 
silently.  Likewise for updateHTableDescriptor,  deleteTableDescriptorIfExists 
and createTableDescriptor.  Those are the only cases of new exceptions that I 
can find.  I'll grant that if there is third-party code (CPs?) that is calling 
these and is currently failing silently it is possible this change would break 
that.  Though that code would have to update for the new method signatures 
anyway.  Do we provide guarantees about compatibility on internal classes?

{quote}Why do we need the lock file approach now? Could we use the previous 
logic of creating files with unique names instead? Might also need to delete 
old tableinfo first, etc.{quote}

It was the only way I could find to make this work with a rolling upgrade (old 
readers and new readers simultaneously) and guarantee atomic updates in the 
presence of failures.  If you can flesh out your approach a bit more and it 
works better, I'd love to lose the locks.  If we proceed with 0.96 only then we 
can do the migration once and don't need the locks.

{quote}Are the pom changes needed?{quote}

Nope, sorry.  That was just to get Eclipse to compile the thing.

{quote}Also, with 130k regions @ 10GB each, do you have a 1.3PB table?!{quote}

Yeah, definitely wish this table had fewer smaller regions.  Been watching the 
online merge discussions.  This is an old table from when the default region 
sizes were much smaller.  We are looking at we grow to migrate to a newer table 
with fatter regions or merge this one down at some point.  However, this issue 
will still be important even at that size (and the data keeps growing!)

Thanks again for your thoughts.
                
> Region assigments scan table directory making them slow for huge tables
> -----------------------------------------------------------------------
>
>                 Key: HBASE-8778
>                 URL: https://issues.apache.org/jira/browse/HBASE-8778
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Dave Latham
>            Assignee: Dave Latham
>             Fix For: 0.98.0, 0.95.2, 0.94.11
>
>         Attachments: HBASE-8778-0.94.5.patch, HBASE-8778-0.94.5-v2.patch
>
>
> On a table with 130k regions it takes about 3 seconds for a region server to 
> open a region once it has been assigned.
> Watching the threads for a region server running 0.94.5 that is opening many 
> such regions shows the thread opening the reigon in code like this:
> {noformat}
> "PRI IPC Server handler 4 on 60020" daemon prio=10 tid=0x00002aaac07e9000 
> nid=0x6566 runnable [0x000000004c46d000]
>    java.lang.Thread.State: RUNNABLE
>         at java.lang.String.indexOf(String.java:1521)
>         at java.net.URI$Parser.scan(URI.java:2912)
>         at java.net.URI$Parser.parse(URI.java:3004)
>         at java.net.URI.<init>(URI.java:736)
>         at org.apache.hadoop.fs.Path.initialize(Path.java:145)
>         at org.apache.hadoop.fs.Path.<init>(Path.java:126)
>         at org.apache.hadoop.fs.Path.<init>(Path.java:50)
>         at 
> org.apache.hadoop.hdfs.protocol.HdfsFileStatus.getFullPath(HdfsFileStatus.java:215)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.makeQualified(DistributedFileSystem.java:252)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:311)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:159)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842)
>         at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:867)
>         at org.apache.hadoop.hbase.util.FSUtils.listStatus(FSUtils.java:1168)
>         at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:269)
>         at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoPath(FSTableDescriptors.java:255)
>         at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.getTableInfoModtime(FSTableDescriptors.java:368)
>         at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:155)
>         at 
> org.apache.hadoop.hbase.util.FSTableDescriptors.get(FSTableDescriptors.java:126)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2834)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:2807)
>         at sun.reflect.GeneratedMethodAccessor64.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> {noformat}
> To open the region, the region server first loads the latest 
> HTableDescriptor.  Since HBASE-4553 HTableDescriptor's are stored in the file 
> system at "/hbase/<tableDir>/.tableinfo.<sequenceNum>".  The file with the 
> largest sequenceNum is the current descriptor.  This is done so that the 
> current descirptor is updated atomically.  However, since the filename is not 
> known in advance FSTableDescriptors it has to do a FileSystem.listStatus 
> operation which has to list all files in the directory to find it.  The 
> directory also contains all the region directories, so in our case it has to 
> load 130k FileStatus objects.  Even using a globStatus matching function 
> still transfers all the objects to the client before performing the pattern 
> matching.  Furthermore HDFS uses a default of transferring 1000 directory 
> entries in each RPC call, so it requires 130 roundtrips to the namenode to 
> fetch all the directory entries.
> Consequently, to reassign all the regions of a table (or a constant fraction 
> thereof) requires time proportional to the square of the number of regions.
> In our case, if a region server fails with 200 such regions, it takes 10+ 
> minutes for them all to be reassigned, after the zk expiration and log 
> splitting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8778) Region assigments scan table directory making them slow for huge tables

Reply via email to