more food for thought on secondary indexing... *Additional questions*:
- How important is indexing column qualifiers themselves (similar to Cassandra where people frequently utilize column qualifiers as "values" with no actual values stored)? - How important is indexing cell timestamps? *More thoughts/my answers on some of the questions I posed:* - From my experience, indexes should be at the region level (e.g. row-level sharding as opposed to term). Other sharding approaches will likely have scale and consistency problems. - In general it seems like there is tension between the main low level approaches of (1) leverage as much HBase infrastructure as possible (e.g. secondary tables) and (2) leverage an efficient indexing library e.g. Lucene. * * *Approach Thoughts* Trying to leverage HBase as much as possible is hard if we want to utilize the approach above and have consistent indexing. However, I think we can do it if we add support for what I will call a "local shadow family". These are additional, internally managed families for a table. However, they have the special characteristic that they belong to the region despite their primary keys being outside the range of the region's. Otherwise they look like a typical family. On splits, they are regenerated (somehow). If we take advantage of Lars' HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>, we then have the opportunity to consistently insert one or more rows into these local shadow families for the purpose of secondary indexing. The structure of these secondary families could use row keys as the indexed values, qualifiers for specific store files and the value of each being a list of originating keys (using read-append or HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>). By leveraging the existing family infrastructure, we get things like optional in-memory indexes and basic scanners for free and don't have to swallow a big chunk of external indexing code. The simplest approach for integration of these for queries would be internally be a ScannerBasedFilter (a filter that is based on a scanner) and a GroupingScanner (a Scanner that does intersection and/or union of scanners for multi criteria queries). Implementation of these scanners could happen at one of two levels: - StoreScanner level: A more efficient approach using the store file qualifier approach above (this allows easier maintenance of index deletions) - RegionScanner level: A simpler implementation with less violation of existing encapsulation. We'd store row keys in qualifiers instead of values to ensure ordering that works iteratively with RegionScanner. The weaknesses of this approach are less efficient scanning and figuring out how to manage primary value deletes. In general, the best way to deal with deletes is probably to age them out per storefile and just filter "near misses" as a secondary filter that works with ScannerBasedFilter. The client side would be TBD but would probably offer some kind of criteria filters that on server side had all the lower level ramifications. *Future Optimizations* In a perfect world, we'd actually use StoreFile block start locations as the index pointer values in the secondary families. This would make things much more compact and efficient. Especially if we used a smarter block codec that took advantage of this nature. However, this requires quite a bit more work since we'd need to actually use the primary keys in the secondary memstore and then "patch" the values to block locations as we flushed the primary family that we were indexing (ugh). Assuming that the primary limiter of peak write throughput for HBase is typically WAL writing and since indexes have no "real" data, we could consider disabling WAL for local shadow families and simply regenerate this data upon primary WAL playback. I haven't spent enough time in that code to know what kind of consistency pain this would cause (my intuition is it would be fine as long as we didn't fix HBASE-3149<https://issues.apache.org/jira/browse/HBASE-3149>). If consistency isn't a problem, this would be a nice option since it means that indexing would have minimal impact on peak write throughput. *I haven't thought at all about...* - How/whether this makes sense to be implemented as a coprocessor. - Weird timestamp impacts/considerations here. - Version handling/impacts. On Sun, Sep 9, 2012 at 8:03 PM, Jacques <whs...@gmail.com> wrote: > Some random thoughts/questions bubbling around in my mind regarding > secondary indexes/indices. > > What are the top 5 use cases people are trying to solve? > What solves more of these needs: synchronous 'transactional' or > asynchronous best-effort (or delayed durable) index commit? > Does family level indexing make sense or is the real need for qualifier > level indexing? > What are ideas for a client interface and how transparent is index usage? > (E.g. if you set a filter on a qualifier... ) > How important is supporting multiple simultaneous criteria or would 90% of > uses cases be captured with single criteria support? > How important is value multi-parsing (e.g. a single value can be indexed > to multiple index values: e.g. free text indexing)? > What were the challenges and issues with the proof of concept TrendMicro > approach that ultimately made it untenable? (was an eventually consistent > approach) > What are people's thoughts regarding region-level alternative structure, > secondary table structure, etc? > Is it important to colocate/duplicate indexed values and/or additional > portions of data in secondary indices to minimize disk seeks (almost making > HBase optionally more columnar in nature)? > How important are multi-qualifier indexes? (e.g. when you want to do a > query for all users who are male engineers that have kids) > How important is partial index matching/ range matching (e.g. startswith > and/or between)? > How important is ordering of returned values? (e.g. if you support > startswith or range matching and you do indexing at the region-level, > you'll be able to get back two rows with the same value the are > interspersed with rows of different values) > > These were partially in response to: > http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing > > http://apache-hbase.679495.n3.nabble.com/what-s-the-roadmap-of-secondary-index-of-hbase-td2573618.html > https://issues.apache.org/jira/browse/HBASE-3529 > https://issues.apache.org/jira/browse/HBASE-2038 > https://issues.apache.org/jira/browse/HBASE-3340 > https://github.com/jyates/culvert > > > > > On Sun, Sep 9, 2012 at 3:44 PM, Stack <st...@duboce.net> wrote: > >> On Sun, Sep 9, 2012 at 3:25 PM, Jesse Yates <jesse.k.ya...@gmail.com> >> wrote: >> > On Sun, Sep 9, 2012 at 3:21 PM, Stack <st...@duboce.net> wrote: >> > >> >> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <jesse.k.ya...@gmail.com> >> >> wrote: >> >> > I think we talked about wanting to do secondary indexing as well, as >> >> least >> >> > what that means for HBase (and maybe some of the _how_ it would work >> >> too). >> >> > >> >> >> >> Mind leading it Jesse? You have the necessary qualifications (smile). >> >> Would suggest you make include rehearsal of points made by Andrew >> >> Purtell and LarsH in the most recent thread on 2ndary indexes. >> >> >> >> >> > ....ok, I can do that :) >> >> Adding you to the list... Thanks J, >> St.Ack >> > >