Here's a setup I've used:
- configuration data distributed to the mappers / reducers using the
JobConf object
- BDBs (stored in ZIP packages on the HDFS) used for read/write data
across stages. The data flow organized so a single mapper modifies a
single database per stage, to avoid concurrency issues.
The concurrency of the shared read/write data will affect the storage
type. In general it's better performance to as much data local as
possible and then distribute it (e.g. store on the HDFS) at the end of
the mapper job. If you need all mappers to share the same data at once,
then a technology like memcache seems like a good approach.
Chandraprakash Bhagtani wrote:
If you really want to share read/write data you can use memcached server or
file based database like Tokyocabinet or BDB