And of course we use Hadoop for calculating the PageRank :) If anyone is interested in how we do that please just ask.
See now that the Log file stuff should be moved below the SOLR stuff if I confused anyone. Finally: We will probably use Hive in the end since our statistical arch looks very much like statements formalized in a bunch of hadoop jobs :) And it is so flexible using Hive! But so little time and this solution works very good so...don't fix what's not broken ? /M On Thu, Jul 2, 2009 at 10:09 PM, Marcus Herou <[email protected]>wrote: > Hi. > > Guys whoa slow down! We are on the same side and agree on most things :) > > We are using hadoop in these cases. > > BlogSearch: > 1. Prospect Crawling: A job exports a few hundred thousand urls from a db > into a flat file and then lets a threaded MapRunnable go crazy crawling on > that one and save the fetched data and response to glusterFS during runtime. > In the end of each url's crawl the state is then moved from "fetchable" to > "parsable" and mapped back into the db. This is done during runtime but > should be done afterwards. > 2. Prospect Parsing: Again export the urls from the db in question into > hadoop where we process them and perform ngramanalysis, spam detection, blog > detection etc and if everyting goes well marks the url as a valid blog and > updates the state accordingly, again during runtime, should be done > afterwards. > 3. Blog Crawling: Fetch the feed urls, basically same as case1 but here we > fetch the feed urls instead. > 4. Blog Parsing: Parse the feed url into a SyndFeed object (rome project) > and process the SyndFeed and all it's SyndEntries and persist them into a DB > during runtime. This could be improved by emitting the entries instead of > saving the to a DB runtime. Each new entry notifies a indexer Q. > 5. Indexing: The indexer polls the Q, loads the id's into a threaded > MapRunnable and go crazy updating SOLR. > > It is basically an extension and refinement of Nutch where we have refined > the plugin architecture to use Spring, created wrappers around some existing > plugins and chnage the storage solution from the CrawlDB and LinkDB to use a > database instead. > > Log file analysis: > We are amongb things a blog advertising company so statistics is key. We > use Hadoop for 95% of the processing of the incoming access logs, well not > access logs but very alike and then emits the data into MonetDB which have > great speed (but bad robustness) for grouping, counting etc. > > The end users are then using SOLR for searching and our sharded DB solution > to fetch the item in question, the blog entry, the blog itself and so on. > Perhaps we will use HBase but after testing last year I ruled HBase out due > to performance since I got more bang for the buck writing my own arch, sorry > Stack... > > The architecture is quite good but just needs some tuning which you pointed > out. Whenever you can emit a record to the outputCollector instead of > updating the DB then you should. > > We now have 90 million blog entries in the DB (in 4 months) and have > prospected over a billion urls so we are doing something right I hope. > > The originating question seem to have got out of scope haha. > > Cheers lads > > //Marcus > > > > > > > > > > > > > On Thu, Jul 2, 2009 at 5:08 PM, Uri Shani <[email protected]> wrote: > >> Hi Marcus, >> If you need a database to serve as an in-between to your customer, than >> you can do massive load to DB of the end results - as I suggested. >> You can avoid that and work with files only. Than you need to serve your >> customers via an API to your hdfs files. Each query will start a M/R job >> to get the results. >> Don't know the response time - yet it is a viable approach to search. >> There are also the hadoop DBs: Hbase and Hive. PIG may also have SQL like >> things in the future, and there is the JAQL project http://www.jaql.org/. >> >> - Uri >> >> >> >> From: >> Marcus Herou <[email protected]> >> To: >> [email protected] >> Date: >> 02/07/2009 05:50 PM >> Subject: >> Re: Parallell maps >> >> >> >> Hi. >> >> Yep I recon this. As I said, I have created a sharded DB solution where >> all >> data is currently spread across 50 shards. I totally agree that one should >> try to emit data to the outputCollector but when one have the data in >> memory >> I do not want to throw it away and re-read it into memory again later on >> after the job is done by chained Hadoop jobs... In our stats system we do >> exactly this, almost no db all over the complex chain, just reducing and >> counting like crazy :) >> >> There is another pro following your example and that is that you risk less >> during the actual job since you do not have to create a db-pool of >> thousands >> of connections. And the jobs do not affect the live production DB = good. >> >> Anyway what do you mean that if you HAVE to load it into a db ? How the >> heck >> are my users gonna access our search engine without accessing a populated >> index or DB ? >> I've been thinking about this in many use-cases and perhaps I am thinking >> totally wrong but I tend to not be able to implement a shared-nothing arch >> all over the chain it is simply not possible. Sometimes you need a DB, >> sometimes you need an index, sometimes a distributed cache, sometimes a >> file >> on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS >> due >> to the non wrapper characteristict of HDFS and it's speed, random access >> IO >> that is). >> >> I will put a mental note to adapt and be more hadoopish :) >> >> Cheers >> >> //Marcus >> >> On Thu, Jul 2, 2009 at 4:27 PM, Uri Shani <[email protected]> wrote: >> >> > Hi, >> > Whenever you try to access DB in parallel you need to design it for >> that. >> > This means, for instance, that you ensure that each of the parallel >> tasks >> > inserts to a distinct "partition" or table in the database to avoid the >> > conflicts and failures. Hadoop does the same in its reduce tasks - each >> > reduce gets ALL the records that are needed to do its function. So there >> > is a hash mapping keys to these tasks. Same principle you need to follow >> > when linking DB partitions to your hadoop tasks. >> > >> > In general, think how to not use DB inserts from Hadoop. Rather, create >> > your results in files. >> > At the end of the process, you can - if this is what you HAVE to do - >> > massively load all the records into the database using efficient loading >> > utilities. >> > >> > If you need the DB to communicate among your tasks, meaning that you >> need >> > the inserts to be readily available for other threads to select, than it >> > is obviously the wrong media for such sharing and you need to look at >> > other solutions to share consistent data among hadoop tasks. For >> instance, >> > zookeeper, etc. >> > >> > Regards, >> > - Uri >> > >> > >> > >> > From: >> > Marcus Herou <[email protected]> >> > To: >> > common-user <[email protected]> >> > Date: >> > 02/07/2009 12:13 PM >> > Subject: >> > Parallell maps >> > >> > >> > >> > Hi. >> > >> > I've noticed that hadoop spawns parallell copies of the same task on >> > different hosts. I've understood that this is due to improve the >> > performance >> > of the job by prioritizing fast running tasks. However since we in our >> > jobs >> > connect to databases this leads to conflicts when inserting, updating, >> > deleting data (duplicated key etc). Yes I know I should consider Hadoop >> as >> > a >> > "Shared Nothing" architecture but I really must connect to databases in >> > the >> > jobs. I've created a sharded DB solution which scales as well or I would >> > be >> > doomed... >> > >> > Any hints of how to disable this feature or howto reduce the impact of >> it >> > ? >> > >> > Cheers >> > >> > /Marcus >> > >> > -- >> > Marcus Herou CTO and co-founder Tailsweep AB >> > +46702561312 >> > [email protected] >> > http://www.tailsweep.com/ >> > >> > >> > >> >> >> -- >> Marcus Herou CTO and co-founder Tailsweep AB >> +46702561312 >> [email protected] >> http://www.tailsweep.com/ >> >> >> > > > -- > Marcus Herou CTO and co-founder Tailsweep AB > +46702561312 > [email protected] > http://www.tailsweep.com/ > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [email protected] http://www.tailsweep.com/
