Here are notes from yesterday's contributor's meetup: http://www.meetup.com/hbaseusergroup/events/80621872/
The notes below are spotty. I kept forgetting to take them. I also was unable to keep up w/ the rate of exchange so in many parts the reporting drops or a speaker's nuanced argument is crassly rendered. We'll hire a stenographer for the next one. We started at 2:00PM. Meeting lasted till almost 6:00PM Devaraj Das of HortonWorks, our host, welcomed everyone. Jimmy Xiang presented on recent changes to Assignment Manager in Master: http://files.meetup.com/1350427/assignment-manager.pdf Good Q&A during the presentation and after included: + What kind of tests do we have in place for the new AM patches? + We should have tests to ensure we don't lose performance + Make sure we don't lose operator facility; e.g. abiilty to override assignment state from shell A general question was posed on what do we all think of the current state of the Assignment Manger? Could we make it pluggable (Francis Liu is looking at making the AM pluggable so can add "Groups")? Will we ever be able to make the AM rock solid? Discussion followed Its complex. Its hard making it pluggable? Or how about adding support for different kinds of policies? Jon Hsieh: Rules that were there originally in design, are they being followed currently? Elliott: AM stuff bled into the Master; stuff is bleeding all over. Need to clean up Master. Enis: Splitting state is split between zk, RS, and HM.... too complex. Could AM/Master just do it? Some one entity should be the source of truth [for region assignment]. Lars Hofhansl: We need to write out the state machine and just rewrite the AM or hack code to get same result Andrew Purtell: We should do both. Testable changes. We have gone through a bunch of master rewrites and suspect that another rewrite would just land us with a new set of issues. Lars: If we had AM state machine, then could [do both rewrite and/or patch it to a state of robustness]. Jon Hsieh: Is Master2 close to its original design? Maybe we should do assignment in another way? Maybe Ram[krishna] understands the AM but it seems like no one else here really does. Yeah, Ram is the AM (he can assign to us how to fix it all). Ted Yu: AM should be able to do colocation for secondary indices and region grouping. Francis Liu: On what region grouping is, if could do different AM, could then assign tables to a group, could make it so they don't affect another application running in a different group all on a big cluster. Same for workloads. Were thinking of doing grouping first, then attack multi-tenancy later. LarsH: Do you need the whole thing pluggable or do you want to rewrite it. Francis: Pluggable would be nice. In the past have subclassed AM to add functionality (would like to avoid that). JDCryans: Could you build grouping on top of HBase by just disabling the balancer and use the move command? Elliott: Could you do your own balancer? Would that work? Francis: Balancer is given a plan only Andrew: Why not do as Karthik suggested in the past, and just run mutiple hbases? Francis: Its too complicated Ted: (Said something about explaining Francis's situation) Jacques: Seems like you want a placement strategy only? Or do you need to change timeouts, etc? Or is it just placement? Elliott: Would it work if we added more facility to the balancer? Maybe make it more pluggable with more levers? Andrew: Yeah, balancer might be way to go for 2ndary indices because want to colocate regions. Enis presented on the new Integration Test patch: http://files.meetup.com/1350427/HBase-meetup-HW.pptx Andrew: Intends to use IT internally. Is looking at porting some of his internal use cases; web crawling, etc., to use the IT framework. Enis then presented on HBase on Window work: See above slides. There are scripts to comparable to the linux bash scripts to run instead launching HBase on Windows, etc. Patches are on their way in. Jesse Yates presented on 2ndary indices: http://files.meetup.com/1350427/SecondaryIndexing.pptx A discussion ensued (Lots of questions and comments during this talk) .... Jon: We need to add a checkAndMutate... to complete our current CAS family of methods. Lars: Index per region means need to farm out to all regions querying... need to do both [this and the index that is non-aligned on region boundaries] Jon: Cassandra 2ndary indexing does per node... turns into optimization on scanning whole table Matt Corgan: We have all sorts of indices [at our shop] and all on the same table; we write the main table and secondary indices all from the client... it seems to keeps up. Its hard to predefine types. Client-side does... doesn't have to be strongly-consistent in his case.. likes it w/o schema... that its all just kvs. This keeps it simple. Jesse: What if you lose write part-way through. Analytics use case. Matt: Building kvs all in a put and then at read time doing reconciles... Lars: Giving tools to make it so you can store floating point as sortable bytes...etc....give you building blocks to help you build your secondary indices as you need them [rather than prescribe a single secondary index soln.] Enis: If we have these building blocks, then we could have hive/pig go against these codecs Lars: Add tools, api and facility to hbase.. you do a bunch of puts, a tool that makes sure all applied....Or some way of getting back all timestamps back for the index puts and then use the returned timestamps to write main table. Figuring what little building blocks to add to HBase.... Matt: All client side, is that right? Jon: What about building indices on qualifiers? Elliott: Building secondary indices could be built on replication where service makes sure all indices are updated Lars: Could do fixup at read time. Jacques: .... wanted to step back (missed most of his question).... wanted to learn more about what are the use cases people have built 2ndary indices for. Andrew: Scan (?). And then looking up by ip address; point indices. Currently does 2ndary indices via bulk uploads loading indices all up in one go. Ratings to data points. A rating below 50 is bad. Scan to avoid most records. Could be done in batch but would be nice online. Jesse: Hive case, full table scan. Matt: point lookup by ipaddress and get ten records, then on same table, scan 1M values in index.. and then do lookup on primary keys... says not as bad... does multiget against primary table. Jacques: ... range is harder.... Lars: Denormalize... so enough data in index. Andrew: Megastore... denormalized or not..... fields of interest goes into the index... yes there are consistency issues. Lars: highlight that indexes can be idempotent Andrew: In our case we could live w/ case where misses record in primary table. .... Jacques: Two options for breaking up data... one namespace has all values... or small namespaces.... then consistent vs eventual Jon: could use both styles on same table... or even on same column. Matt: event table.. has categories associated... Jesse: .... Matt: ..... does it in the client... likes schemaless-ness of doing it all in client ....doesn't want to go go db and make changes there. Jacques: MySQL schema changes is supposedly 'hard' but its actually easy... its the 'business' implications of schema change that are what are hard. Andrew: Comparators... is there an indexing and filters overlap? .... A union of the filter and indexing Intefaces at some time.... Lars: So where should we start... build into hbase idempotent transactions in a consistent way? Should we do that first? Ship a mapreduce job w/ it to populate an index table from the main table.... seems simple and you can still do all your stuff from client. Region pinning? Jacques: What does region pinning get you? If 20 data regions and one index region, how you lay it out? Will index be fraction or larger than the original data? ...region-level approach.... is better. Elliott: table-level approach.... Andrew: constraint-solving in the master... Enis: pinning how?.. colocate as much as possible, lots of info? Andrew: Do something simple first... Enis: How you store the info that stuff belongs together? Pinning sounds like a long-shot ... Enis: Shall we start with some simple use cases? Once we have these, then we could add code to satisfy it. Jesse: Two or three strawmen... even in the book would be a good first cut. Elliott: This would require async, that would require these codecs..... do it in book first? Jon: Can we do filter/scanner-like interface? What will it look like from a programmers' POV? From the strawman kinda case? Its hbase style.. you need to figure the right one... but there will be an option as opposed to now. Jesse: Under hood client will do lookup... Jon: Maybe a different scanner, an index scanner rather than a scanner.. so different. Jesse: Index scanner seems reasonable Jon: One to one and one to many lookpus would make sense in the Index scanner. Jacques: Until reasonable design docs on how to implement...... can't do API. Dave Wang: What use cases do we want to solve? Andrew: Sounds like point lookups is the most important one as opposed to range considerations... that we should do first. Jacques: Ranges vs distinct values..... < 20. Konstantin: There is an alternative to indexes... partitioning. Don't index, partition. (Draws grid, then cubes it -- explaining idea from the "Processing a trillion cells per mouse click" paper) .... Andrew: Start on client then push stuff server-side as needed into CP. Announcements: + Jon announced meetup in NYC at appnexus: http://www.meetup.com/HBase-NYC/events/81728932/ + Devaraj said HW are hiring in all areas + Jesse says SF hiring. More general discussion around: + Getting Jenkins green? What can we do? + Should there be more friction around committing? Suggestion that we try making lieutenants responsible for particular areas in the code base. These folks would have to +1 anything that touches their area. Formalize it in JIRA so these folks are auto-assigned issues when their component is chose. Lieutenants do not have to be committers, just someone interested in an area: e.g. Devaraj volunteered to look after rpc even though not committer. + Should each component have goals or at least a simple design posted; could help reviewing patches. + Who is going to do the all the work? Contributors? + What are top three problem areas... + Contributors tend to work on what is 'cool' or what their customer needs fixed. Was suggested that the "mythical volunteer contributor" is not seen in the wild countered by folks reporting actual "sightings" of such phantoms. We all broke up into groups and gnoshed on pizza and beer. A good day out was had by all. Thanks HW for hosting.
