That's awesome information, Marcus. I am working on a project which would require a similar architectural solution (although unlike you I can't broadcast the details), so that was very useful. One thing I can say though is that mine is in no way a competitor, being in a different area.
If I could find out more - would be even better. For example, how do you do Page Rank. Although I think that I have seen PageRank algorithm in MR somewhere (Google actually playfully revealing the secret), and surely Pregel promises this code in 15 lines. Cheers, Mark On Thu, Jul 2, 2009 at 3:09 PM, Marcus Herou <[email protected]>wrote: > Hi. > > Guys whoa slow down! We are on the same side and agree on most things :) > > We are using hadoop in these cases. > > BlogSearch: > 1. Prospect Crawling: A job exports a few hundred thousand urls from a db > into a flat file and then lets a threaded MapRunnable go crazy crawling on > that one and save the fetched data and response to glusterFS during > runtime. > In the end of each url's crawl the state is then moved from "fetchable" to > "parsable" and mapped back into the db. This is done during runtime but > should be done afterwards. > 2. Prospect Parsing: Again export the urls from the db in question into > hadoop where we process them and perform ngramanalysis, spam detection, > blog > detection etc and if everyting goes well marks the url as a valid blog and > updates the state accordingly, again during runtime, should be done > afterwards. > 3. Blog Crawling: Fetch the feed urls, basically same as case1 but here we > fetch the feed urls instead. > 4. Blog Parsing: Parse the feed url into a SyndFeed object (rome project) > and process the SyndFeed and all it's SyndEntries and persist them into a > DB > during runtime. This could be improved by emitting the entries instead of > saving the to a DB runtime. Each new entry notifies a indexer Q. > 5. Indexing: The indexer polls the Q, loads the id's into a threaded > MapRunnable and go crazy updating SOLR. > > It is basically an extension and refinement of Nutch where we have refined > the plugin architecture to use Spring, created wrappers around some > existing > plugins and chnage the storage solution from the CrawlDB and LinkDB to use > a > database instead. > > Log file analysis: > We are amongb things a blog advertising company so statistics is key. We > use > Hadoop for 95% of the processing of the incoming access logs, well not > access logs but very alike and then emits the data into MonetDB which have > great speed (but bad robustness) for grouping, counting etc. > > The end users are then using SOLR for searching and our sharded DB solution > to fetch the item in question, the blog entry, the blog itself and so on. > Perhaps we will use HBase but after testing last year I ruled HBase out due > to performance since I got more bang for the buck writing my own arch, > sorry > Stack... > > The architecture is quite good but just needs some tuning which you pointed > out. Whenever you can emit a record to the outputCollector instead of > updating the DB then you should. > > We now have 90 million blog entries in the DB (in 4 months) and have > prospected over a billion urls so we are doing something right I hope. > > The originating question seem to have got out of scope haha. > > Cheers lads > > //Marcus > > > > > > > > > > > > On Thu, Jul 2, 2009 at 5:08 PM, Uri Shani <[email protected]> wrote: > > > Hi Marcus, > > If you need a database to serve as an in-between to your customer, than > > you can do massive load to DB of the end results - as I suggested. > > You can avoid that and work with files only. Than you need to serve your > > customers via an API to your hdfs files. Each query will start a M/R job > > to get the results. > > Don't know the response time - yet it is a viable approach to search. > > There are also the hadoop DBs: Hbase and Hive. PIG may also have SQL like > > things in the future, and there is the JAQL project http://www.jaql.org/ > . > > > > - Uri > > > > > > > > From: > > Marcus Herou <[email protected]> > > To: > > [email protected] > > Date: > > 02/07/2009 05:50 PM > > Subject: > > Re: Parallell maps > > > > > > > > Hi. > > > > Yep I recon this. As I said, I have created a sharded DB solution where > > all > > data is currently spread across 50 shards. I totally agree that one > should > > try to emit data to the outputCollector but when one have the data in > > memory > > I do not want to throw it away and re-read it into memory again later on > > after the job is done by chained Hadoop jobs... In our stats system we do > > exactly this, almost no db all over the complex chain, just reducing and > > counting like crazy :) > > > > There is another pro following your example and that is that you risk > less > > during the actual job since you do not have to create a db-pool of > > thousands > > of connections. And the jobs do not affect the live production DB = good. > > > > Anyway what do you mean that if you HAVE to load it into a db ? How the > > heck > > are my users gonna access our search engine without accessing a populated > > index or DB ? > > I've been thinking about this in many use-cases and perhaps I am thinking > > totally wrong but I tend to not be able to implement a shared-nothing > arch > > all over the chain it is simply not possible. Sometimes you need a DB, > > sometimes you need an index, sometimes a distributed cache, sometimes a > > file > > on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS > > due > > to the non wrapper characteristict of HDFS and it's speed, random access > > IO > > that is). > > > > I will put a mental note to adapt and be more hadoopish :) > > > > Cheers > > > > //Marcus > > > > On Thu, Jul 2, 2009 at 4:27 PM, Uri Shani <[email protected]> wrote: > > > > > Hi, > > > Whenever you try to access DB in parallel you need to design it for > > that. > > > This means, for instance, that you ensure that each of the parallel > > tasks > > > inserts to a distinct "partition" or table in the database to avoid the > > > conflicts and failures. Hadoop does the same in its reduce tasks - each > > > reduce gets ALL the records that are needed to do its function. So > there > > > is a hash mapping keys to these tasks. Same principle you need to > follow > > > when linking DB partitions to your hadoop tasks. > > > > > > In general, think how to not use DB inserts from Hadoop. Rather, create > > > your results in files. > > > At the end of the process, you can - if this is what you HAVE to do - > > > massively load all the records into the database using efficient > loading > > > utilities. > > > > > > If you need the DB to communicate among your tasks, meaning that you > > need > > > the inserts to be readily available for other threads to select, than > it > > > is obviously the wrong media for such sharing and you need to look at > > > other solutions to share consistent data among hadoop tasks. For > > instance, > > > zookeeper, etc. > > > > > > Regards, > > > - Uri > > > > > > > > > > > > From: > > > Marcus Herou <[email protected]> > > > To: > > > common-user <[email protected]> > > > Date: > > > 02/07/2009 12:13 PM > > > Subject: > > > Parallell maps > > > > > > > > > > > > Hi. > > > > > > I've noticed that hadoop spawns parallell copies of the same task on > > > different hosts. I've understood that this is due to improve the > > > performance > > > of the job by prioritizing fast running tasks. However since we in our > > > jobs > > > connect to databases this leads to conflicts when inserting, updating, > > > deleting data (duplicated key etc). Yes I know I should consider Hadoop > > as > > > a > > > "Shared Nothing" architecture but I really must connect to databases in > > > the > > > jobs. I've created a sharded DB solution which scales as well or I > would > > > be > > > doomed... > > > > > > Any hints of how to disable this feature or howto reduce the impact of > > it > > > ? > > > > > > Cheers > > > > > > /Marcus > > > > > > -- > > > Marcus Herou CTO and co-founder Tailsweep AB > > > +46702561312 > > > [email protected] > > > http://www.tailsweep.com/ > > > > > > > > > > > > > > > -- > > Marcus Herou CTO and co-founder Tailsweep AB > > +46702561312 > > [email protected] > > http://www.tailsweep.com/ > > > > > > > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > [email protected] > http://www.tailsweep.com/ >
