Re: TU Berlin Winter of Code Project

Ken Krugler Sat, 07 Nov 2009 16:23:52 -0800

Hi Max (& Ted),

On Nov 6, 2009, at 11:57am, Ted Dunning wrote:

The question that I don't see addressed is whether you choose to usea fullystreaming approach as is done in Bixo or whether you will use adocument
repository approach as is more common in most search engines.

I think the issue here isn't about streaming vs. document repository -all systems have elements of both, it's just that...

a. Bixo exposes this more explicitly, by focusing on the workflowaspects of web mining.

But Nutch also has sequences of map-reduce tasks that are run during acrawl (e.g. filter URLs, group them, then fetch & parse).

b. Bixo doesn't have a baked in URL database, or file-system schemefor saving content.

If you look at the example SimpleCrawlTool class in Bixo, for example,you'll see that it (similar to Nutch) is using a SequenceFile to storethe URL state, and sequence files in sub-directories for fetchedcontent & parse results.

But Bixo just does the simple thing of propagates the URL stateforward into successive crawl directories, versus updating a singleURL database. Having a URL DB is what you'd want for large-scale webcrawling.

If you wanted to configure Bixo to use HBase to store the URL stateand fetched/parsed content, you'd use an HBase tap (in Cascading-speak) versus the Hfs tap.

Hbase is reputedly ready enough to serve as a document repository.Usingsuch an approach would be very helpful for the incremental nature ofweb
crawls.

I'd gotten the same input from Andrew Purtell, who's been able tostream lots of crawl data into HBase, after a bit of fiddling withconfiguration settings and also some patching on the writer side ofthings.

As far as pre-processing and feature extraction, both could beimplemented as Cascading operations (that wind up mapping to Hadooptasks).

As Ted noted, actually doing the named entity extraction and featureextraction will be the real challenge.


See this talk for an example of doing web mining using Bixo - 
http://www.slideshare.net/sh1mmer/the-bixo-web-mining-toolkit

-- Ken

On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll<[email protected]>wrote:
This is obviously only a first draft of what we think would be asuited
overall
architecture
--
Ted Dunning, CTO
DeepDyve


--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: TU Berlin Winter of Code Project

Reply via email to