The goal is to
avoid entering 100,000 regex in the
craw-urlfilter.xml and checking ALL
these regex for each URL. Any comment?
Sure seems like just some hash look up table could
handle it. I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do.
Especially
Matt,
This is great! It would be very useful to Nutch developers if your code
can be shared. I'm sure quite a few applications will benefit from it
because it fills a gap between whole-web crawling and single site (or a
handful of sites) crawling. I'll be interested in adapting your plugin
I'm going to make a request in Jira now. -AJ
--- Matt Kangas [EMAIL PROTECTED] wrote:
Great! Is there a ticket in JIRA requesting this
feature? If not, we
should file one and get a few votes in favor of it.
AFAIK, that's the
process for getting new features into Nutch.
On Sep 2,
I assume that in most NDFS-based configurations the production search
system will not run out of NDFS. Rather, indexes will be created
offline for a deployment (i.e., merging things to create an index per
search node), then copied out of NDFS to the local filesystem on a
production search
I'm pretty new to nutch, but in reading through the mail lists and other
papers, I don't think I've really seen any discussion on using ndfs with
respect to automating end to end workflow for data that is going to be
searched (fetch-index-merge-search).
The few crawler designs I'm familiar with