On Thu, Jul 2, 2009 at 7:49 AM, Marcus Herou <[email protected]>wrote:
> Anyway what do you mean that if you HAVE to load it into a db ? How the > heck > are my users gonna access our search engine without accessing a populated > index or DB ? > It is very common for hadoop to process data in files and leave results in files. Since hadoop is not suitable for interactive processes, it matters little how the data is kept in the middle of the process. If your use-case requires that normal users see the results in a database, then exporting final results to a database may be necessary. If so, then a final bulk load is by far the best solution for this. If you use-case requires that normal users be able to search the data using text-like searches, then a much better option is to do a final map-reduce job to spread your data across many shards that are each collected in a reducer. Each reducer can then build a Lucene index to be served using something like Katta. This allows you to maintain bulk-style operations until the very end and does not require any database operations at all. If you need to use data *from* a database, then it is usually a much better option to dump the database to HDFS and either use the side-data mechanism to share the data to all mappers or reducers, or just write a map-reduce program to move the data to the right place. It is almost never a good idea to have map-reduce programs make random queries to a database. If they even probe 1% of your database, you are probably much better off just using map-reduce to move the data to the right place instead. If you have a combined case where data from a map-reduce should be in the database and then used by a subsequent map-reduce program, you should consider optimizing away the database dump and leave the data in hadoop for the next map-reduce step. It is just fine to load it to the database each time, but don't waste your effort by dumping it again. Finally, if you were to repurpose those 50 DB shards, you would be able to have a whomping fast hadoop cluster. > I've been thinking about this in many use-cases and perhaps I am thinking > totally wrong but I tend to not be able to implement a shared-nothing arch > all over the chain it is simply not possible. Sometimes you need a DB, > What is a database? Is it just a bunch of records? If so, you can store it in a file. Do you really think you need random access? Try doing a join in map-reduce instead. > sometimes you need an index, sometimes a distributed cache, sometimes a > file > on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS > due > to the non wrapper characteristict of HDFS and it's speed, random access IO > that is). > If you are thinking about random access I/O, then you aren't generally thinking about map-reduce. The *whole* point is that you don't need most random accesses in a batch setting. Take the class terabyte update problem. You have 1TB of data in 100B records. You need to update 1% of them. Option 1 (traditional, random access): Assume 10 ms seek, 10ms rotation, 100MB/s transfer for your data. You have 10^12 / 10^2 = 10^10 records. You need to update 10^8 of them. Each update costs 10ms seek, 10 ms rotation to read, 10 ms rotation to write = 30 ms. 10^8 of these require 3 Ms = 35 days. Option 2 (map-reduce, sequential transfer): Read the entire database sequentially and write the entire database back, substituting the changed records. Assuming the same disk as before and assuming that we read the data in 1GB chunks, we have 1000 reads, 1000 writes to do. Each one costs 1 seek and then 10^9B / 10^8B/s = 10 seconds to read dn the same to write. Total time is 20 ks = 5.5 hours. You can argue that option 2 is doing 100x more work than it should, but you have to account for the fact that it still runs more than 100 times faster. This example is a cartoon, but is surprisingly realistic. Whenever you say random access, I think that you are paying four orders of magnitude more in costs than you should. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 http://www.deepdyve.com 858-414-0013 (m) 408-773-0220 (fax)
