Hi Max (& Ted),
On Nov 6, 2009, at 11:57am, Ted Dunning wrote:
The question that I don't see addressed is whether you choose to use
a fully
streaming approach as is done in Bixo or whether you will use a
document
repository approach as is more common in most search engines.
I think the issue here isn't about streaming vs. document repository -
all systems have elements of both, it's just that...
a. Bixo exposes this more explicitly, by focusing on the workflow
aspects of web mining.
But Nutch also has sequences of map-reduce tasks that are run during a
crawl (e.g. filter URLs, group them, then fetch & parse).
b. Bixo doesn't have a baked in URL database, or file-system scheme
for saving content.
If you look at the example SimpleCrawlTool class in Bixo, for example,
you'll see that it (similar to Nutch) is using a SequenceFile to store
the URL state, and sequence files in sub-directories for fetched
content & parse results.
But Bixo just does the simple thing of propagates the URL state
forward into successive crawl directories, versus updating a single
URL database. Having a URL DB is what you'd want for large-scale web
crawling.
If you wanted to configure Bixo to use HBase to store the URL state
and fetched/parsed content, you'd use an HBase tap (in Cascading-
speak) versus the Hfs tap.
Hbase is reputedly ready enough to serve as a document repository.
Using
such an approach would be very helpful for the incremental nature of
web
crawls.
I'd gotten the same input from Andrew Purtell, who's been able to
stream lots of crawl data into HBase, after a bit of fiddling with
configuration settings and also some patching on the writer side of
things.
As far as pre-processing and feature extraction, both could be
implemented as Cascading operations (that wind up mapping to Hadoop
tasks).
As Ted noted, actually doing the named entity extraction and feature
extraction will be the real challenge.
See this talk for an example of doing web mining using Bixo -
http://www.slideshare.net/sh1mmer/the-bixo-web-mining-toolkit
-- Ken
On Fri, Nov 6, 2009 at 11:47 AM, Grant Ingersoll
<[email protected]>wrote:
This is obviously only a first draft of what we think would be a
suited
overall
architecture
--
Ted Dunning, CTO
DeepDyve
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g