I like this, Shi!  Very clever!

On Wed, Jun 15, 2011 at 4:36 PM, Shi Yu <[email protected]> wrote:

> Suppose you are looking up a value V of a key K.   And V is required for an
> upcoming process. Suppose the data in the upcoming process  has the form
>
> R1  K1 K2 K3,
>
> where R1 is the record number, K1 to K3 are the keys occurring in the
> record, which means in the look up case you would query for V1, V2, V3
>
> Using inner join you could attach all the V values for a single record and
> prepare the data like
>
> R1 K1 K2 K3 V1 V2 V3
>
> then each record has the complete information for the next process. So you
> pay the storage for the efficiency. Even taking into account the time
> required for preparing the data, it is still faster than the look-up
> approach.
>
> I have also tried TokyoCabinet, you need to compile and install some
> extensions to get it working. Sometimes getting things and APIs to work can
> be painful. If you don't need to update the lookup table, install TC,
> MemCache, MongoDB locally on each node would be the most efficient solution
> because all the look-ups are local.
>
>
> On 6/15/2011 5:56 PM, Ian Upright wrote:
>
>> If the data set doesn't fit in working memory, but is still of a
>> reasonable
>> size  (lets say a few hundred gigabytes), then I'd probably use something
>> like this:
>>
>> http://fallabs.com/tokyocabinet/
>>
>>  From reading the Hadoop docs (which I'm very new to), then I might use
>> DistributedCache to replicate that database around.  My impression would
>> be
>> that this might be among the most efficient things one could do.
>>
>> However, for my particular application, even using tokycabinet introduces
>> too much inefficiency, and a pure plain old memory-based lookups is by far
>> the most efficient.  (not to mention that some of the lookups I'm doing
>> are
>> specialized trees that can't be done with tokyocabinet or any typical db,
>> but thats beside the point)
>>
>> I'm having trouble understanding your more efficient method by using more
>> data and HDFS, and having trouble understanding how it could possibly be
>> any
>> more efficient than say the above approach.
>>
>> How is increasing the size minimizing the lookups?
>>
>> Ian
>>
>>  I had the same problem before, a big lookup table too large to load in
>>> memory.
>>>
>>> I tried and compared the following approaches:  in-memory MySQL DB, a
>>> dedicated central memcache server, a dedicated central MongoDB server,
>>> local DB (each node has its own MongoDB server) model.
>>>
>>> The local DB model is the most efficient one.  I believe dedicated
>>> server approach could get improved if the number of server is increased
>>> and distributed. I just tried single server.
>>>
>>> But later I dropped out the lookup table approach. Instead, I attached
>>> the table information in the HDFS (which could be considered as an inner
>>> join DB process), which significantly increases the size of data sets
>>> but avoids the bottle neck of table look up. There is a trade-off, when
>>> no table looks up, the data to process is intensive (TB size). In
>>> contrast, a look-up table could save 90% of the data storage.
>>>
>>> According to our experiments on a 30-node cluster, attaching information
>>> in HDFS is even 20%  faster than the local DB model. When attaching
>>> information in HDFS, it is also easier to ping-pong Map/Reduce
>>> configuration to further improve the efficiency.
>>>
>>> Shi
>>>
>>> On 6/15/2011 5:05 PM, GOEKE, MATTHEW (AG/1000) wrote:
>>>
>>>> Is the lookup table constant across each of the tasks? You could try
>>>> putting it into memcached:
>>>>
>>>> http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf
>>>>
>>>> Matt
>>>>
>>>> -----Original Message-----
>>>> From: Ian Upright [mailto:[email protected]]
>>>> Sent: Wednesday, June 15, 2011 3:42 PM
>>>> To: [email protected]
>>>> Subject: large memory tasks
>>>>
>>>> Hello, I'm quite new to Hadoop, so I'd like to get an understanding of
>>>> something.
>>>>
>>>> Lets say I have a task that requires 16gb of memory, in order to
>>>> execute.
>>>> Lets say hypothetically it's some sort of big lookuptable of sorts that
>>>> needs that kind of memory.
>>>>
>>>> I could have 8 cores run the task in parallel (multithreaded), and all 8
>>>> cores can share that 16gb lookup table.
>>>>
>>>> On another machine, I could have 4 cores run the same task, and they
>>>> still
>>>> share that same 16gb lookup table.
>>>>
>>>> Now, with my understanding of Hadoop, each task has it's own memory.
>>>>
>>>> So if I have 4 tasks that run on one machine, and 8 tasks on another,
>>>> then
>>>> the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine,
>>>> but
>>>> really, lets say I only have two machines, one with 4 cores and one with
>>>> 8,
>>>> each machine only having 24 GB.
>>>>
>>>> How can the work be evenly distributed among these machines?  Am I
>>>> missing
>>>> something?  What other ways can this be configured such that this works
>>>> properly?
>>>>
>>>> Thanks, Ian
>>>> This e-mail message may contain privileged and/or confidential
>>>> information, and is intended to be received only by persons entitled
>>>> to receive such information. If you have received this e-mail in error,
>>>> please notify the sender immediately. Please delete it and
>>>> all attachments from any servers, hard drives or any other media. Other
>>>> use of this e-mail by you is strictly prohibited.
>>>>
>>>> All e-mails and attachments sent and received are subject to monitoring,
>>>> reading and archival by Monsanto, including its
>>>> subsidiaries. The recipient of this e-mail is solely responsible for
>>>> checking for the presence of "Viruses" or other "Malware".
>>>> Monsanto, along with its subsidiaries, accepts no liability for any
>>>> damage caused by any such code transmitted by or accompanying
>>>> this e-mail or any attachment.
>>>>
>>>>
>>>> The information contained in this email may be subject to the export
>>>> control laws and regulations of the United States, potentially
>>>> including but not limited to the Export Administration Regulations (EAR)
>>>> and sanctions regulations issued by the U.S. Department of
>>>> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of
>>>> this information you are obligated to comply with all
>>>> applicable U.S. export laws and regulations.
>>>>
>>>>
>
>

Reply via email to