And of course we use Hadoop for calculating the PageRank :) If anyone is
interested in how we do that please just ask.

See now that the Log file stuff should be moved below the SOLR stuff if I
confused anyone.

Finally: We will probably use Hive in the end since our statistical arch
looks very much like statements formalized in a bunch of hadoop jobs :) And
it is so flexible using Hive! But so little time and this solution works
very good so...don't fix what's not broken ?

/M

On Thu, Jul 2, 2009 at 10:09 PM, Marcus Herou <[email protected]>wrote:

> Hi.
>
> Guys whoa slow down! We are on the same side and agree on most things :)
>
> We are using hadoop in these cases.
>
> BlogSearch:
> 1. Prospect Crawling: A job exports a few hundred thousand urls from a db
> into a flat file and then lets a threaded MapRunnable go crazy crawling on
> that one and save the fetched data and response to glusterFS during runtime.
> In the end of each url's crawl the state is then moved from "fetchable" to
> "parsable" and mapped back into the db. This is done during runtime but
> should be done afterwards.
> 2. Prospect Parsing: Again export the urls from the db in question into
> hadoop where we process them and perform ngramanalysis, spam detection, blog
> detection etc and if everyting goes well marks the url as a valid blog and
> updates the state accordingly, again during runtime, should be done
> afterwards.
> 3. Blog Crawling: Fetch the feed urls, basically same as case1 but here we
> fetch the feed urls instead.
> 4. Blog Parsing: Parse the feed url into a SyndFeed object (rome project)
> and process the SyndFeed and all it's SyndEntries and persist them into a DB
> during runtime. This could be improved by emitting the entries instead of
> saving the to a DB runtime. Each new entry notifies a indexer Q.
> 5. Indexing: The indexer polls the Q, loads the id's into a threaded
> MapRunnable and go crazy updating SOLR.
>
> It is basically an extension and refinement of Nutch where we have refined
> the plugin architecture to use Spring, created wrappers around some existing
> plugins and chnage the storage solution from the CrawlDB and LinkDB to use a
> database instead.
>
> Log file analysis:
> We are amongb things a blog advertising company so statistics is key. We
> use Hadoop for 95% of the processing of the incoming access logs, well not
> access logs but very alike and then emits the data into MonetDB which have
> great speed (but bad robustness) for grouping, counting etc.
>
> The end users are then using SOLR for searching and our sharded DB solution
> to fetch the item in question, the blog entry, the blog itself and so on.
> Perhaps we will use HBase but after testing last year I ruled HBase out due
> to performance since I got more bang for the buck writing my own arch, sorry
> Stack...
>
> The architecture is quite good but just needs some tuning which you pointed
> out. Whenever you can emit a record to the outputCollector instead of
> updating the DB then you should.
>
> We now have 90 million blog entries in the DB (in 4 months) and have
> prospected over a billion urls so we are doing something right I hope.
>
> The originating question seem to have got out of scope haha.
>
> Cheers lads
>
> //Marcus
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Jul 2, 2009 at 5:08 PM, Uri Shani <[email protected]> wrote:
>
>> Hi Marcus,
>> If you need a database to serve as an in-between to your customer, than
>> you can do massive load to DB of the end results - as I suggested.
>> You can avoid that and work with files only. Than you need to serve your
>> customers via an API to your hdfs files. Each query will start a M/R job
>> to get the results.
>> Don't know the response time - yet it is a viable approach to search.
>> There are also the hadoop DBs: Hbase and Hive. PIG may also have SQL like
>> things in the future, and there is the JAQL project http://www.jaql.org/.
>>
>> - Uri
>>
>>
>>
>> From:
>> Marcus Herou <[email protected]>
>> To:
>> [email protected]
>> Date:
>> 02/07/2009 05:50 PM
>> Subject:
>> Re: Parallell maps
>>
>>
>>
>> Hi.
>>
>> Yep I recon this. As I said, I have created a sharded DB solution where
>> all
>> data is currently spread across 50 shards. I totally agree that one should
>> try to emit data to the outputCollector but when one have the data in
>> memory
>> I do not want to throw it away and re-read it into memory again later on
>> after the job is done by chained Hadoop jobs... In our stats system we do
>> exactly this, almost no db all over the complex chain, just reducing and
>> counting like crazy :)
>>
>> There is another pro following your example and that is that you risk less
>> during the actual job since you do not have to create a db-pool of
>> thousands
>> of connections. And the jobs do not affect the live production DB = good.
>>
>> Anyway what do you mean that if you HAVE to load it into a db ? How the
>> heck
>> are my users gonna access our search engine without accessing a populated
>> index or DB ?
>> I've been thinking about this in many use-cases and perhaps I am thinking
>> totally wrong but I tend to not be able to implement a shared-nothing arch
>> all over the chain it is simply not possible. Sometimes you need a DB,
>> sometimes you need an index, sometimes a distributed cache, sometimes a
>> file
>> on local FS/NFS/glusterFS (yes I prefer classical mount points over HDFS
>> due
>> to the non wrapper characteristict of HDFS and it's speed, random access
>> IO
>> that is).
>>
>> I will put a mental note to adapt and be more hadoopish :)
>>
>> Cheers
>>
>> //Marcus
>>
>> On Thu, Jul 2, 2009 at 4:27 PM, Uri Shani <[email protected]> wrote:
>>
>> > Hi,
>> > Whenever you try to access DB in parallel you need to design it for
>> that.
>> > This means, for instance, that you ensure that each of the parallel
>> tasks
>> > inserts to a distinct "partition" or table in the database to avoid the
>> > conflicts and failures. Hadoop does the same in its reduce tasks - each
>> > reduce gets ALL the records that are needed to do its function. So there
>> > is a hash mapping keys to these tasks. Same principle you need to follow
>> > when linking DB partitions to your hadoop tasks.
>> >
>> > In general, think how to not use DB inserts from Hadoop. Rather, create
>> > your results in files.
>> > At the end of the process, you can - if this is what you HAVE to do -
>> > massively load all the records into the database using efficient loading
>> > utilities.
>> >
>> > If you need the DB to communicate among your tasks, meaning that you
>> need
>> > the inserts to be readily available for other threads to select, than it
>> > is obviously the wrong media for such sharing and you need to look at
>> > other solutions to share consistent data among hadoop tasks. For
>> instance,
>> > zookeeper, etc.
>> >
>> > Regards,
>> > - Uri
>> >
>> >
>> >
>> > From:
>> > Marcus Herou <[email protected]>
>> > To:
>> > common-user <[email protected]>
>> > Date:
>> > 02/07/2009 12:13 PM
>> > Subject:
>> > Parallell maps
>> >
>> >
>> >
>> > Hi.
>> >
>> > I've noticed that hadoop spawns parallell copies of the same task on
>> > different hosts. I've understood that this is due to improve the
>> > performance
>> > of the job by prioritizing fast running tasks. However since we in our
>> > jobs
>> > connect to databases this leads to conflicts when inserting, updating,
>> > deleting data (duplicated key etc). Yes I know I should consider Hadoop
>> as
>> > a
>> > "Shared Nothing" architecture but I really must connect to databases in
>> > the
>> > jobs. I've created a sharded DB solution which scales as well or I would
>> > be
>> > doomed...
>> >
>> > Any hints of how to disable this feature or howto reduce the impact of
>> it
>> > ?
>> >
>> > Cheers
>> >
>> > /Marcus
>> >
>> > --
>> > Marcus Herou CTO and co-founder Tailsweep AB
>> > +46702561312
>> > [email protected]
>> > http://www.tailsweep.com/
>> >
>> >
>> >
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> [email protected]
>> http://www.tailsweep.com/
>>
>>
>>
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [email protected]
> http://www.tailsweep.com/
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/

Reply via email to