Hi Andrzej, Let me tell you my scenario, I had 160GB of input data to process based on some business logic, which was using 1.2GB of metadata in the form of in-memory hash maps. Now my problem was that, hadoop doesn't provide any shared storage among parallel tasks, so I had to keep this 1.2GB of metadata in-memory in every running task. This led me run less number of parallel tasks.
As a solution to this problem I first used Memcached and then Tokyocabinet to hold metadata only. The input and output data was stored in HDFS. While processing, any task (map/reduce) could fetch metadata from Tokyocabinet. I was not running Tokyo Tyrant. I used TC java api. On Mon, Oct 5, 2009 at 7:44 PM, Andrzej Jan Taramina <andr...@chaeron.com>wrote: > Chandraprakash: > > Thanks for the info! Great stuff, but it's lead to a few more questions... > > > I had run a big mapred job (160GB data) on a small cluster of 7 nodes. I > > had started 15 Memcached server instances on 7 nodes and I noticed that > > a single memcached server was processing 1 million requests per second > > (in my case), however it was definitely 3-4 times slower than in-memory > > approach. I had to increase the limit of open file descriptors for that. > > Was the input for the mapred job coming from Tokyo Cabinet, or were you > just writing the results of the mapred to TC? > > If you were using TC for input to mapred, how did you do the Input Splits? > Did you write a custom splitter for Tokyo > Cabinet? > > > Since memcached was not performing up to my expectations I used > > Tokyocabinet ( A file based database) and its performance was near to > > in-memory approach. > > Were you using Tokyo Cabinet over the network, that is, using Tokyo Tyrant? > Or were you running and accessing a local TC > process? > > Thanks for shedding some light on these additional questions... > > -- > Andrzej Taramina > Chaeron Corporation: Enterprise System Solutions > http://www.chaeron.com > -- Thanks & Regards, Chandra Prakash Bhagtani,