hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce



Reply via email to