Re: Automating workflow using ndfs

2005-09-02 Thread Earl Cahill
The goal is to avoid entering 100,000 regex in the craw-urlfilter.xml and checking ALL these regex for each URL. Any comment? Sure seems like just some hash look up table could handle it. I am having a hard time seeing when you really need a regex and a fixed list wouldn't do. Especially

Re: Automating workflow using ndfs

2005-09-02 Thread AJ Chen
Matt, This is great! It would be very useful to Nutch developers if your code can be shared. I'm sure quite a few applications will benefit from it because it fills a gap between whole-web crawling and single site (or a handful of sites) crawling. I'll be interested in adapting your plugin

Re: Automating workflow using ndfs

2005-09-02 Thread Anjun Chen
I'm going to make a request in Jira now. -AJ --- Matt Kangas [EMAIL PROTECTED] wrote: Great! Is there a ticket in JIRA requesting this feature? If not, we should file one and get a few votes in favor of it. AFAIK, that's the process for getting new features into Nutch. On Sep 2,

Re: Automating workflow using ndfs

2005-08-31 Thread Doug Cutting
I assume that in most NDFS-based configurations the production search system will not run out of NDFS. Rather, indexes will be created offline for a deployment (i.e., merging things to create an index per search node), then copied out of NDFS to the local filesystem on a production search

Automating workflow using ndfs

2005-08-28 Thread Jay Lorenzo
I'm pretty new to nutch, but in reading through the mail lists and other papers, I don't think I've really seen any discussion on using ndfs with respect to automating end to end workflow for data that is going to be searched (fetch-index-merge-search). The few crawler designs I'm familiar with