Right, but the 2.78hours [threadWakeFrequency (10000ms) x multiplier (1000ms)] check comes before a full compaction and from the code it looks like the NPE would have blocked any Store that had no timeRangeTracker from every getting a major compaction... unless it had been triggered in some other way.
James Kennedy Project Manager Troove Inc. On 2011-02-10, at 11:23 PM, Ryan Rawson wrote: > we only major compact a region ever 24 hours, therefore if it was JUST > compacted within the last 24 hours we skip it. > > this is how it used to work, and how it should still work, not really > looking at code right now, busy elsewhere :-) > > -ryan > > On Thu, Feb 10, 2011 at 11:17 PM, James Kennedy > <[email protected]> wrote: >> Can you define 'come due'? >> >> The NPE occurs at the first isMajorCompaction() test in the main loop of >> MajorCompactionChecker. >> That cycle is executed every 2.78 hours. >> Yet I know that I've kept healthy QA test data up and running for much >> longer than that. >> >> >> James Kennedy >> Project Manager >> Troove Inc. >> >> On 2011-02-10, at 10:46 PM, Ryan Rawson wrote: >> >>> I am speaking off the hip here, but the major compaction algorithm >>> attempts to keep the number of major compactions to a minimum by >>> checking the timestamp of the file. So it's possible that the other >>> regions just 'didnt come due' yet. >>> >>> -ryan >>> >>> On Thu, Feb 10, 2011 at 10:42 PM, James Kennedy >>> <[email protected]> wrote: >>>> I've tested HBase 0.90 + HBase-trx 0.90.0 and i've run it over old data >>>> from 0.89x using a variety of seeded unit test/QA data and cluster >>>> configurations. >>>> >>>> But when it came time to upgrade some production data I got snagged on >>>> HBASE-3524. The gist of it is in Ryan's last points: >>>> >>>> * compaction is "optional", meaning if it fails no data is lost, so you >>>> should probably be fine. >>>> >>>> * Older versions of the code did not write out time tracker data and >>>> that is why your older files were giving you NPEs. >>>> >>>> Makes sense. But why did I not encounter this with my initial data >>>> upgrades on very similar data pkgs? >>>> >>>> So I applied Ryan's patch, which simply assigns a default value >>>> (Long.MIN_VALUE) when a StoreFile lacks a timeRangeTracker and I "fixed" >>>> the data by forcing major compactions on the regions affected. >>>> Preliminary poking has not shown any instability in the data since. >>>> >>>> But I confess that I just don't have the time right now to really dig into >>>> the code and validate that there are no more gotchya's or data corruption >>>> that could have resulted. >>>> >>>> I guess the questions that I have for the team are: >>>> >>>> * What state would 9 out of 50 tables be in to miss the new 0.90.0 >>>> timeRangeTracker injection before the first major compaction check? >>>> * Where else is the new TimeRangeTracker used? Could a StoreFile with a >>>> null timeRangeTracker have corrupted the data in other subtler ways? >>>> * What other upgrade-related data changes might not have completed >>>> elsewhere? >>>> >>>> Thanks, >>>> >>>> James Kennedy >>>> Project Manage >>>> Troove Inc. >>>> >>>> >> >>
