I like this, Shi! Very clever! On Wed, Jun 15, 2011 at 4:36 PM, Shi Yu <[email protected]> wrote:
> Suppose you are looking up a value V of a key K. And V is required for an > upcoming process. Suppose the data in the upcoming process has the form > > R1 K1 K2 K3, > > where R1 is the record number, K1 to K3 are the keys occurring in the > record, which means in the look up case you would query for V1, V2, V3 > > Using inner join you could attach all the V values for a single record and > prepare the data like > > R1 K1 K2 K3 V1 V2 V3 > > then each record has the complete information for the next process. So you > pay the storage for the efficiency. Even taking into account the time > required for preparing the data, it is still faster than the look-up > approach. > > I have also tried TokyoCabinet, you need to compile and install some > extensions to get it working. Sometimes getting things and APIs to work can > be painful. If you don't need to update the lookup table, install TC, > MemCache, MongoDB locally on each node would be the most efficient solution > because all the look-ups are local. > > > On 6/15/2011 5:56 PM, Ian Upright wrote: > >> If the data set doesn't fit in working memory, but is still of a >> reasonable >> size (lets say a few hundred gigabytes), then I'd probably use something >> like this: >> >> http://fallabs.com/tokyocabinet/ >> >> From reading the Hadoop docs (which I'm very new to), then I might use >> DistributedCache to replicate that database around. My impression would >> be >> that this might be among the most efficient things one could do. >> >> However, for my particular application, even using tokycabinet introduces >> too much inefficiency, and a pure plain old memory-based lookups is by far >> the most efficient. (not to mention that some of the lookups I'm doing >> are >> specialized trees that can't be done with tokyocabinet or any typical db, >> but thats beside the point) >> >> I'm having trouble understanding your more efficient method by using more >> data and HDFS, and having trouble understanding how it could possibly be >> any >> more efficient than say the above approach. >> >> How is increasing the size minimizing the lookups? >> >> Ian >> >> I had the same problem before, a big lookup table too large to load in >>> memory. >>> >>> I tried and compared the following approaches: in-memory MySQL DB, a >>> dedicated central memcache server, a dedicated central MongoDB server, >>> local DB (each node has its own MongoDB server) model. >>> >>> The local DB model is the most efficient one. I believe dedicated >>> server approach could get improved if the number of server is increased >>> and distributed. I just tried single server. >>> >>> But later I dropped out the lookup table approach. Instead, I attached >>> the table information in the HDFS (which could be considered as an inner >>> join DB process), which significantly increases the size of data sets >>> but avoids the bottle neck of table look up. There is a trade-off, when >>> no table looks up, the data to process is intensive (TB size). In >>> contrast, a look-up table could save 90% of the data storage. >>> >>> According to our experiments on a 30-node cluster, attaching information >>> in HDFS is even 20% faster than the local DB model. When attaching >>> information in HDFS, it is also easier to ping-pong Map/Reduce >>> configuration to further improve the efficiency. >>> >>> Shi >>> >>> On 6/15/2011 5:05 PM, GOEKE, MATTHEW (AG/1000) wrote: >>> >>>> Is the lookup table constant across each of the tasks? You could try >>>> putting it into memcached: >>>> >>>> http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf >>>> >>>> Matt >>>> >>>> -----Original Message----- >>>> From: Ian Upright [mailto:[email protected]] >>>> Sent: Wednesday, June 15, 2011 3:42 PM >>>> To: [email protected] >>>> Subject: large memory tasks >>>> >>>> Hello, I'm quite new to Hadoop, so I'd like to get an understanding of >>>> something. >>>> >>>> Lets say I have a task that requires 16gb of memory, in order to >>>> execute. >>>> Lets say hypothetically it's some sort of big lookuptable of sorts that >>>> needs that kind of memory. >>>> >>>> I could have 8 cores run the task in parallel (multithreaded), and all 8 >>>> cores can share that 16gb lookup table. >>>> >>>> On another machine, I could have 4 cores run the same task, and they >>>> still >>>> share that same 16gb lookup table. >>>> >>>> Now, with my understanding of Hadoop, each task has it's own memory. >>>> >>>> So if I have 4 tasks that run on one machine, and 8 tasks on another, >>>> then >>>> the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine, >>>> but >>>> really, lets say I only have two machines, one with 4 cores and one with >>>> 8, >>>> each machine only having 24 GB. >>>> >>>> How can the work be evenly distributed among these machines? Am I >>>> missing >>>> something? What other ways can this be configured such that this works >>>> properly? >>>> >>>> Thanks, Ian >>>> This e-mail message may contain privileged and/or confidential >>>> information, and is intended to be received only by persons entitled >>>> to receive such information. If you have received this e-mail in error, >>>> please notify the sender immediately. Please delete it and >>>> all attachments from any servers, hard drives or any other media. Other >>>> use of this e-mail by you is strictly prohibited. >>>> >>>> All e-mails and attachments sent and received are subject to monitoring, >>>> reading and archival by Monsanto, including its >>>> subsidiaries. The recipient of this e-mail is solely responsible for >>>> checking for the presence of "Viruses" or other "Malware". >>>> Monsanto, along with its subsidiaries, accepts no liability for any >>>> damage caused by any such code transmitted by or accompanying >>>> this e-mail or any attachment. >>>> >>>> >>>> The information contained in this email may be subject to the export >>>> control laws and regulations of the United States, potentially >>>> including but not limited to the Export Administration Regulations (EAR) >>>> and sanctions regulations issued by the U.S. Department of >>>> Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of >>>> this information you are obligated to comply with all >>>> applicable U.S. export laws and regulations. >>>> >>>> > >
