Hi All, I am trying to figure out a good solution for such a scenario as following.
1. I have a 2T file (let's call it A), filled by key/value pairs, which is stored in the HDFS with the default 64M block size. In A, each key is less than 1K and each value is about 20M. 2. Occasionally, I will run analysis by using a different type of data (usually less than 10G, and let's call it B) and do look-up table alike operations by using the values in A. B resides in HDFS as well. 3. This analysis would require loading only a small number of values from A (usually less than 1000 of them) into the memory for fast look-up against the data in B. The way B finds the few values in A is by looking up for the key in A. Is there an efficient way to do this? I was thinking if I could identify the locality of the block that contains the few values, I might be able to push the B into the few nodes that contains the few values in A? Since I only need to do this occasionally, maintaining a distributed database such as HBase cant be justified. Many thanks. Cao
