[ https://issues.apache.org/jira/browse/HBASE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Purtell updated HBASE-2001: ---------------------------------- Attachment: HBASE-2001.patch.gz Latest patch contains simple working unit tests for basic Coprocessor hooks and also RegionObserver interface hooks. Also, the initial implementation of an in-process MapReduce framework. Coprocessors can optionally implement a 'MapReduce' interface which clients will at some point be able to invoke concurrently on all regions of the table within the HRS processes. (Server side needs unit tests and testing; no client side yet.) Note this is not MapReduce on the table; this is MapReduce on each region, concurrently. In-process MapReduce is multithreaded. Concurrency of mappers and reducers is specified separately. Map jobs are submitted with a Scan object which defines the scope and any filters for a scanner which feeds mappers. Mappers can emit intermediate KeyValues to a collector for reduction or can get references to objects in the coprocessor's environment and perform operations on them, e.g. increment an AtomicLong, etc. Reducers will get KeyValues from map phase output ordered and grouped by key. Reducers also have access to objects in the coprocessor environment. Therefore one can implement MapReduce in a manner very similar to Hadoop's MR framework, or e.g. aggregating functions can use shared variables to avoid the overhead of generating (and processing) a lot of intermediates. An in-process MapReduce job can be configured to auto commit. If so, KeyValues written to the reduce collector by reducers will be automatically committed back to the region after all reducers have completed execution. Up until all mappers and reducers successfully complete execution no values are committed to the region. Then, we try really hard to commit them all. KeyValues emitted by reducers must have a row key that falls within the bounds of the region if the job is auto committing. Otherwise, the output can be arbitrary. If a job is not auto committing, when it completes clients have access to the KeyValues output by the reducer via a scanner like interface. The in-process MapReduce framework uses leases. A job is only alive as long as it has a lease. Its output KeyValues are only available as long as it has a lease. So for long running jobs the client must periodically poll status to keep it alive, and then retrieval by "scanner" will also renew the lease. A lease cannot expire during auto commit. > Coprocessors: Colocate user code with regions > --------------------------------------------- > > Key: HBASE-2001 > URL: https://issues.apache.org/jira/browse/HBASE-2001 > Project: Hadoop HBase > Issue Type: Sub-task > Reporter: Andrew Purtell > Assignee: Andrew Purtell > Attachments: asm-3.2-bin.zip, asm-transformations.pdf, > HBASE-2001.patch.gz > > > Support user code that runs run next to each region in table. As regions > split and move, coprocessor code should automatically move also. > Use classloader which looks on HDFS. > Associate a list of classes to load with each table. Put this in HRI so it > inherits from table but can be changed on a per region basis (so then those > region specific changes can inherited by daughters). > Not completely arbitrary code, should require implementation of an interface > with callbacks for: > * Open > * Close > * Split > * Compact > * (Multi)get and scanner next() > * (Multi)put > * (Multi)delete > Add method to HRegionInterface for invoking coprocessor methods and > retrieving results. > Add methods in o.a.h.h.regionserver or subpackage which implement convenience > functions for coprocessor methods and consistent/controlled access to > internals: store access, threading, persistent and ephemeral state, scratch > storage, etc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.