Thanks Ted. I'm looked at the history of that code and believe that this race and test case only causes a perf problem (we may roll extra times) and has been around since 0.90.5.
I'm +1'ing this for release also. On Thu, Dec 19, 2013 at 9:45 AM, Ted Yu <[email protected]> wrote: > There used to be some comment around low replication checking: > > // TODO: preserving the old behavior for now, but this check is > strange. It's not // protected by any locks here, so for > all we know rolling locks might start // as soon as we > enter the "if". Is this best-effort optimization check? if > (!this.logRollRunning) { > checkLowReplication(); > > > This means that checkLowReplication() may be running when > FSHLog#rollWriter() is also running - hence the race. > That is why checkLowReplication() is now put under reentrant lock. > > w.r.t. rolling new RC, I would leave the decision making up to you. > > Low replication check has several parameters. One of which is the > following: > > this.lowReplicationRollLimit = conf.getInt( > > "hbase.regionserver.hlog.lowreplication.rolllimit", 5); > > When consecutive log rolls exceeds the above conf value, lowReplicationRoll > is disabled. > > If replica count goes down because of issue in Datanodes, the above limit > would be reached ultimately. > > Cheers > > On Thu, Dec 19, 2013 at 9:21 AM, Jonathan Hsieh <[email protected]> wrote: > > > Ted, > > > > My question really boils down this this -- can we have a data loss if we > > don't take in HBASE-10142 or not? It will take me a little more time > > reverse engineer and convince myself one way or another. If we can > lose > > data, then I'll roll take the port and do another rc. If we don't (we > are > > just less performant) then since we have 4 binding +1's other than me > I'll > > release. > > > > I either case, I want to understand how it fixes the problem. I can see > the > > changes e.g. the re-entrant lock, and the shifting out of > > checkLowReplication call, but I need to figure out why the changes were > > necessary. The main substantive change is that the lock is taken around > the > > checkLowReplication call but I haven't put together why this fixes the > > problem. Can you shed more light (and ideally lay out why the changes > fix > > the problem in jira? -- maybe a bad multithread trace that is fixed by > the > > new lock?) > > > > Thanks, > > Jon. > > > > > > On Thu, Dec 19, 2013 at 9:01 AM, Ted Yu <[email protected]> wrote: > > > > > Jon: > > > If Stack gives the greenlight, I can certainly port it to 0.96 branch. > > > > > > Cheers > > > > > > > > > On Thu, Dec 19, 2013 at 8:39 AM, Jonathan Hsieh <[email protected]> > > wrote: > > > > > > > When I run the test standalone, and didn't have an failure in > > > 0.96.1.1rc0 > > > > or 0.96.1. When I ran the whole suite, I ran into exactly the same > > > failure > > > > on 0.96.1.1 (currnetly testing full suite from 0.96.1 src tar ball) > > > > > > > > I've spent some time reviewing HBASE-10142, There are some non-test > > code > > > > modifications still trying to determine if it is a serious problem or > > not > > > > on that side. Ted, is there a reason why this wasn't ported to the > > 0.96 > > > > branch? > > > > > > > > Jon. > > > > > > > > > > > > On Wed, Dec 18, 2013 at 9:00 PM, Jean-Marc Spaggiari < > > > > [email protected]> wrote: > > > > > > > > > Typo is because I have no done a cut&past ;) > > > > > > > > > > With cut&past: mvn test -PrunAllTests > > > -Dsurefire.secondPartThreadCount=8 > > > > > > > > > > On the last run, error is: > > > > > Failed tests: > > > > > > > > > > > > > > > > > > > > testLogRollOnDatanodeDeath(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling): > > > > > LowReplication Roller should've been disabled, current > replication=1 > > > > > > > > > > I did not keep the errors from the previous runs... > > > > > > > > > > > > > > > 2013/12/18 Ted Yu <[email protected]> > > > > > > > > > > > What error(s) did you see ? > > > > > > > > > > > > There was a typo in this def ('s' between D and u): > > > > > > -Dsurefire.secondPartThreadCount > > > > > > > > > > > > You can lower the thread count: in trunk build, value of 2 is > used. > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2013 at 8:45 PM, Jean-Marc Spaggiari < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > I tried multiple times over many hours to run: > > > > > > > mvn test -PrunallTests -Dusefire.secondPartThreadCount=8 > > > > > > > > > > > > > > On a local machine using the src jar, with no success. I might > be > > > > > missing > > > > > > > something... I will investigate so I will be able to provide > > better > > > > > > > feedback for 0.96.2... > > > > > > > > > > > > > > Sorry about that. > > > > > > > > > > > > > > > > > > > > > 2013/12/18 Enis Söztutar <[email protected]> > > > > > > > > > > > > > > > +1. > > > > > > > > > > > > > > > > - downloaded the artifacts > > > > > > > > - checked checksums > > > > > > > > - checked sigs > > > > > > > > - checked hadoop libs in h1 / h2 > > > > > > > > - checked directory layouts > > > > > > > > - run local cluster > > > > > > > > - run smoke tests with shell on the artifacts > > > > > > > > - run tests locally: > > > > > > > > -- bin/hbase org.apache.hadoop.hbase.util.LoadTestTool > -write > > > > > > 10:10:100 > > > > > > > > -num_keys 1000000 -read 100:30 > > > > > > > > -- bin/hbase > > > > > > "org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList > > > > > > > > Loop 1 1 3000000 /tmp/biglinkedlist 1" > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Dec 18, 2013 at 12:45 PM, Stack <[email protected]> > > > wrote: > > > > > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > > > > > Downloaded, unbundled, checked layout, and ran it. Browsed > > the > > > > mvn > > > > > > > > > artifacts. > > > > > > > > > > > > > > > > > > St.Ack > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Dec 17, 2013 at 2:23 PM, Jonathan Hsieh < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > This is a quick-fix release directly off of the 0.96.1 > > > release. > > > > > > It > > > > > > > > can > > > > > > > > > be > > > > > > > > > > downloaded here: > > > > > > > > > > > > > > > > > > > > http://people.apache.org/~jmhsieh/hbase-0.96.1.1rc0/ > > > > > > > > > > > > > > > > > > > > The maven staging repo is here: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://repository.apache.org/content/repositories/orgapachehbase-056/ > > > > > > > > > > > > > > > > > > > > There is only one jira'ed patch in this release, > > HBASE-10188 > > > > [1]. > > > > > > > This > > > > > > > > > > fixes an API incompatibility introduced between 0.96.0 > and > > > > > 0.96.1. > > > > > > > > Other > > > > > > > > > > changes include updates to CHANGES.txt (to include that > > > jira), > > > > > and > > > > > > > > > pom.xml > > > > > > > > > > (naming the release 0.96.1.1). > > > > > > > > > > > > > > > > > > > > As such, we'll have an abridged testing and voting > period. > > > > > Please > > > > > > > > have > > > > > > > > > a > > > > > > > > > > quick look and vote +1/-1 by 12/18/13 23:59 pacific time. > > If > > > > this > > > > > > > > passes > > > > > > > > > > we'll take down 0.96.1. > > > > > > > > > > > > > > > > > > > > Please check the release mechanics -- this is my first > > > attempt > > > > at > > > > > > an > > > > > > > > > hbase > > > > > > > > > > release. > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-10188 > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > // Jonathan Hsieh (shay) > > > > > > > > > > // Software Engineer, Cloudera > > > > > > > > > > // [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > // Jonathan Hsieh (shay) > > > > // Software Engineer, Cloudera > > > > // [email protected] > > > > > > > > > > > > > > > -- > > // Jonathan Hsieh (shay) > > // Software Engineer, Cloudera > > // [email protected] > > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [email protected]
