Bruce,

I had similar problem a year ago... I needed very specific crawling and data
mining, I decided to use a database, and I was able to rewrite everything
within a week (thanks to Nutch developers!) (needed to implement very
specific business case);

My first approach was to modify parse-html plugin; it writes directly to a
database 'path' and 'query params'; and it writes to a database some
specific 'tokens' such as product name, price, etc.

What I found: 
- performance of a database (such as Oracle or Postgre) is the main
bottleneck
- I need to 'mine' everything in-memory and minimize file read/write
operations (minimize HDD I/O, and use pure Java)

I had some (may be useful) ideas:
- using statistics (how many anchors have similar text pointing same page
during period in time), define 'category' of info, such as 'product
category', 'subcategory', 'manufacturer'
- define 'dynamic' crawl, e.g. frequently re-crawl 'frequently-queried'
pages 

I think existing Nutch is very 'generic', a lot of plugins such as
'parse-mpg', 'parse-pdf',... It repeats logic/functionality of Google...
'Anchor text is a true subject of a page!' - Google Bombing...

So, if it is 'Data Mining' engine, I believe just creating of additional
plugin for Nutch is not enough: you have to define additional classes,
'Outlink' does not have functionality of 'Query Parameters' etc... And you
need to define datastore, existing WebDB interface is not enough... You will
need to rewrite Nutch... And there are no any suitable 'extension points'...


If you need just HTML crawl/mining - focus on it...


-----Original Message-----
From: bruce 
Sent: Saturday, June 24, 2006 2:40 AM
To: [email protected]
Cc: [EMAIL PROTECTED]
Subject: RE: nutch - functionality..


hi fuad,

it lloks like you're looking at what i'm trying to do as though it's for a
search engine... it's not...

i'm looking to create a crawler to extract specific information. as such, i
need to emulate some of the function of a crawler. i also need to implement
other functionality that's apparently not in the usual spider/crawler
function. being able to selectively iterate/follow through forms (GET/POST)
in a recursive manner is a requirement. as is being able to selectively
define which form elements i'm going to use when i do the crawling....

of course this approach is only possible because i have causal knowledge of
the structure of the site prior to me crawling it...

-bruce



-----Original Message-----
From: Fuad Efendi 
Sent: Friday, June 23, 2006 8:28 PM
To: [email protected]
Subject: RE: nutch - functionality..


Nutch is plugin-based, similar to Eclipse;
You can extend Nutch functionality, just browse src/plugin/parse-html source
folder as a sample; you can modify Java code so that it will handle 'POST'
from forms (Outlink class instances) (I am well familiar with v.0.7.1, new
version of Nutch is significantly richer)
It's the easiest starting point: parse-html plugin...

I don't see any reason why search engine should return list of pages found
(including POST of forms), and <A href="...">PageFound</A> (of a Nutch
Search Results end-user screen) does not have functionality of POST.

Only one case: response may provide new Outlink instances, such as response
from 'Search' pages of E-Commerce sites... And most probably such
'second-level' outlinks are reachable via GET; sample - 'Search' page with
POST on any E-Commerce site...



-----Original Message-----
From: bruce
Subject: nutch - functionality..


hi...

i might be a little out of my league.. but here goes...

i'm in need of an app to crawl through sections of sites, and to return
pieces of information. i'm not looking to do any indexing, just returning
raw html/text...

however, i need the ability to set certain criteria to help define the
actual pages that get returned...

a given crawling process, would normally start at some URL, and iteratively
fetch files underneath the URL. nutch does this as well as providing some
additional functionality.

i need more functionality....

in particular, i'd like to be able to modify the way nutch handles forms,
and links/queries on a given page.

i'd like to be able to:

for forms:
 allow the app to handle POST/GET forms
 allow the app to select (implement/ignore) given
  elements within a form
 track the FORM(s) for a given URL/page/level of the crawl

for links:
 allow the app to either include/exclude a given link
  for a given page/URL via regex parsing or list of
  URLs
 allow the app to handle querystring data, ie
  to include/exclude the URL+Query based on regex
  parsing or simple text comparison

data extraction:
 abiility to do xpath/regex extraction based on the DOM
 permit multiple xpath/regex functions to be run on a
  given page


this kind of functionality would allow the 'nutch' function to be relatively
selective regarding the ability to crawl through a site and extract the
required information....

any thoughts/comments/ideas/etc.. regarding this process.

if i shouldn't use nutch, are there any suggestions as to what app i should
use.

thanks

-bruce








Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to