Here's a setup I've used:
- configuration data distributed to the mappers / reducers using the JobConf object - BDBs (stored in ZIP packages on the HDFS) used for read/write data across stages. The data flow organized so a single mapper modifies a single database per stage, to avoid concurrency issues.

The concurrency of the shared read/write data will affect the storage type. In general it's better performance to as much data local as possible and then distribute it (e.g. store on the HDFS) at the end of the mapper job. If you need all mappers to share the same data at once, then a technology like memcache seems like a good approach.





Chandraprakash Bhagtani wrote:
If you really want to share read/write data you can use memcached server or
file based database like Tokyocabinet or BDB



Reply via email to