Thanks Stack for the notes, and thanks everyone for coming. Cheers, Enis
On Wed, Sep 12, 2012 at 11:21 AM, Stack <[email protected]> wrote: > Here are notes from yesterday's contributor's meetup: > http://www.meetup.com/hbaseusergroup/events/80621872/ > > The notes below are spotty. I kept forgetting to take them. I also > was unable to keep up w/ the rate of exchange so in many parts the > reporting drops or a speaker's nuanced argument is crassly rendered. > > We'll hire a stenographer for the next one. > > We started at 2:00PM. Meeting lasted till almost 6:00PM > > Devaraj Das of HortonWorks, our host, welcomed everyone. > > Jimmy Xiang presented on recent changes to Assignment Manager in > Master: http://files.meetup.com/1350427/assignment-manager.pdf > > Good Q&A during the presentation and after included: > > + What kind of tests do we have in place for the new AM patches? > + We should have tests to ensure we don't lose performance > + Make sure we don't lose operator facility; e.g. abiilty to override > assignment state from shell > > A general question was posed on what do we all think of the current > state of the Assignment Manger? Could we make it pluggable (Francis > Liu is looking at making the AM pluggable so can add "Groups")? Will > we ever be able to make the AM rock solid? > > Discussion followed > > Its complex. Its hard making it pluggable? Or how about adding > support for different kinds of policies? > > Jon Hsieh: Rules that were there originally in design, are they being > followed currently? > > Elliott: AM stuff bled into the Master; stuff is bleeding all over. > Need to clean up Master. > Enis: Splitting state is split between zk, RS, and HM.... too complex. > Could AM/Master just do it? Some one entity should be the source of > truth [for region assignment]. > Lars Hofhansl: We need to write out the state machine and just rewrite > the AM or hack code to get same result > Andrew Purtell: We should do both. Testable changes. We have gone > through a bunch of master rewrites and suspect that another rewrite > would just land us with a new set of issues. > Lars: If we had AM state machine, then could [do both rewrite and/or > patch it to a state of robustness]. > Jon Hsieh: Is Master2 close to its original design? Maybe we should > do assignment in another way? Maybe Ram[krishna] understands the AM > but it seems like no one else here really does. Yeah, Ram is the AM > (he can assign to us how to fix it all). > Ted Yu: AM should be able to do colocation for secondary indices and > region grouping. > Francis Liu: On what region grouping is, if could do different AM, > could then assign tables to a group, could make it so they don't > affect another application running in a different group all on a big > cluster. Same for workloads. Were thinking of doing grouping first, > then attack multi-tenancy later. > LarsH: Do you need the whole thing pluggable or do you want to rewrite it. > Francis: Pluggable would be nice. In the past have subclassed AM to > add functionality (would like to avoid that). > JDCryans: Could you build grouping on top of HBase by just disabling > the balancer and use the move command? > Elliott: Could you do your own balancer? Would that work? > Francis: Balancer is given a plan only > Andrew: Why not do as Karthik suggested in the past, and just run > mutiple hbases? > Francis: Its too complicated > Ted: (Said something about explaining Francis's situation) > Jacques: Seems like you want a placement strategy only? Or do you > need to change timeouts, etc? Or is it just placement? > Elliott: Would it work if we added more facility to the balancer? > Maybe make it more pluggable with more levers? > Andrew: Yeah, balancer might be way to go for 2ndary indices because > want to colocate regions. > > Enis presented on the new Integration Test patch: > http://files.meetup.com/1350427/HBase-meetup-HW.pptx > > Andrew: Intends to use IT internally. Is looking at porting some of > his internal use cases; web crawling, etc., to use the IT framework. > > Enis then presented on HBase on Window work: See above slides. > > There are scripts to comparable to the linux bash scripts to run > instead launching HBase on Windows, etc. > > Patches are on their way in. > > Jesse Yates presented on 2ndary indices: > http://files.meetup.com/1350427/SecondaryIndexing.pptx > > A discussion ensued (Lots of questions and comments during this talk) > > .... > Jon: We need to add a checkAndMutate... to complete our current CAS > family of methods. > Lars: Index per region means need to farm out to all regions > querying... need to do both [this and the index that is non-aligned on > region boundaries] > Jon: Cassandra 2ndary indexing does per node... turns into > optimization on scanning whole table > Matt Corgan: We have all sorts of indices [at our shop] and all on the > same table; we write the main table and secondary indices all from the > client... it seems to keeps up. Its hard to predefine types. > Client-side does... doesn't have to be strongly-consistent in his > case.. likes it w/o schema... that its all just kvs. This keeps it > simple. > Jesse: What if you lose write part-way through. Analytics use case. > Matt: Building kvs all in a put and then at read time doing reconciles... > Lars: Giving tools to make it so you can store floating point as > sortable bytes...etc....give you building blocks to help you build > your secondary indices as you need them [rather than prescribe a > single secondary index soln.] > Enis: If we have these building blocks, then we could have hive/pig go > against these codecs > Lars: Add tools, api and facility to hbase.. you do a bunch of puts, a > tool that makes sure all applied....Or some way of getting back all > timestamps back for the index puts and then use the returned > timestamps to write main table. Figuring what little building blocks > to add to HBase.... > Matt: All client side, is that right? > Jon: What about building indices on qualifiers? > Elliott: Building secondary indices could be built on replication > where service makes sure all indices are updated > Lars: Could do fixup at read time. > Jacques: .... wanted to step back (missed most of his question).... > wanted to learn more about what are the use cases people have built > 2ndary indices for. > Andrew: Scan (?). And then looking up by ip address; point indices. > Currently does 2ndary indices via bulk uploads loading indices all up > in one go. Ratings to data points. A rating below 50 is bad. Scan > to avoid most records. Could be done in batch but would be nice > online. > Jesse: Hive case, full table scan. > Matt: point lookup by ipaddress and get ten records, then on same > table, scan 1M values in index.. and then do lookup on primary keys... > says not as bad... does multiget against primary table. > Jacques: ... range is harder.... > Lars: Denormalize... so enough data in index. > Andrew: Megastore... denormalized or not..... fields of interest goes > into the index... yes there are consistency issues. > Lars: highlight that indexes can be idempotent > Andrew: In our case we could live w/ case where misses record in primary > table. > .... > Jacques: Two options for breaking up data... one namespace has all > values... or small namespaces.... then consistent vs eventual > Jon: could use both styles on same table... or even on same column. > Matt: event table.. has categories associated... > Jesse: .... > Matt: ..... does it in the client... likes schemaless-ness of doing it > all in client ....doesn't want to go go db and make changes there. > Jacques: MySQL schema changes is supposedly 'hard' but its actually > easy... its the 'business' implications of schema change that are what > are hard. > Andrew: Comparators... is there an indexing and filters overlap? .... > A union of the filter and indexing Intefaces at some time.... > Lars: So where should we start... build into hbase idempotent > transactions in a consistent way? Should we do that first? Ship a > mapreduce job w/ it to populate an index table from the main table.... > seems simple and you can still do all your stuff from client. Region > pinning? > Jacques: What does region pinning get you? If 20 data regions and one > index region, how you lay it out? Will index be fraction or larger > than the original data? ...region-level approach.... is better. > Elliott: table-level approach.... > Andrew: constraint-solving in the master... > Enis: pinning how?.. colocate as much as possible, lots of info? > Andrew: Do something simple first... > Enis: How you store the info that stuff belongs together? Pinning > sounds like a long-shot > ... > Enis: Shall we start with some simple use cases? Once we have these, > then we could add code to satisfy it. > Jesse: Two or three strawmen... even in the book would be a good first cut. > Elliott: This would require async, that would require these > codecs..... do it in book first? > Jon: Can we do filter/scanner-like interface? What will it look like > from a programmers' POV? From the strawman kinda case? > Its hbase style.. you need to figure the right one... but there will > be an option as opposed to now. > Jesse: Under hood client will do lookup... > Jon: Maybe a different scanner, an index scanner rather than a > scanner.. so different. > Jesse: Index scanner seems reasonable > Jon: One to one and one to many lookpus would make sense in the Index > scanner. > Jacques: Until reasonable design docs on how to implement...... can't do > API. > Dave Wang: What use cases do we want to solve? > Andrew: Sounds like point lookups is the most important one as opposed > to range considerations... that we should do first. > Jacques: Ranges vs distinct values..... < 20. > Konstantin: There is an alternative to indexes... partitioning. Don't > index, partition. > (Draws grid, then cubes it -- explaining idea from the "Processing a > trillion cells per mouse click" paper) > .... > Andrew: Start on client then push stuff server-side as needed into CP. > > Announcements: > > + Jon announced meetup in NYC at appnexus: > http://www.meetup.com/HBase-NYC/events/81728932/ > + Devaraj said HW are hiring in all areas > + Jesse says SF hiring. > > More general discussion around: > > + Getting Jenkins green? What can we do? > > + Should there be more friction around committing? > > Suggestion that we try making lieutenants responsible for particular > areas in the code base. These folks would have to +1 anything that > touches their area. Formalize it in JIRA so these folks are > auto-assigned issues when their component is chose. > > Lieutenants do not have to be committers, just someone interested in > an area: e.g. Devaraj volunteered to look after rpc even though not > committer. > > + Should each component have goals or at least a simple design posted; > could help reviewing patches. > > + Who is going to do the all the work? > > Contributors? > > + What are top three problem areas... > + Contributors tend to work on what is 'cool' or what their customer > needs fixed. Was suggested that the "mythical volunteer contributor" > is not seen in the wild countered by folks reporting actual > "sightings" of such phantoms. > > We all broke up into groups and gnoshed on pizza and beer. > > A good day out was had by all. > > Thanks HW for hosting. >
