Bruce, I had similar problem a year ago... I needed very specific crawling and data mining, I decided to use a database, and I was able to rewrite everything within a week (thanks to Nutch developers!) (needed to implement very specific business case);
My first approach was to modify parse-html plugin; it writes directly to a database 'path' and 'query params'; and it writes to a database some specific 'tokens' such as product name, price, etc. What I found: - performance of a database (such as Oracle or Postgre) is the main bottleneck - I need to 'mine' everything in-memory and minimize file read/write operations (minimize HDD I/O, and use pure Java) I had some (may be useful) ideas: - using statistics (how many anchors have similar text pointing same page during period in time), define 'category' of info, such as 'product category', 'subcategory', 'manufacturer' - define 'dynamic' crawl, e.g. frequently re-crawl 'frequently-queried' pages I think existing Nutch is very 'generic', a lot of plugins such as 'parse-mpg', 'parse-pdf',... It repeats logic/functionality of Google... 'Anchor text is a true subject of a page!' - Google Bombing... So, if it is 'Data Mining' engine, I believe just creating of additional plugin for Nutch is not enough: you have to define additional classes, 'Outlink' does not have functionality of 'Query Parameters' etc... And you need to define datastore, existing WebDB interface is not enough... You will need to rewrite Nutch... And there are no any suitable 'extension points'... If you need just HTML crawl/mining - focus on it... -----Original Message----- From: bruce Sent: Saturday, June 24, 2006 2:40 AM To: [email protected] Cc: [EMAIL PROTECTED] Subject: RE: nutch - functionality.. hi fuad, it lloks like you're looking at what i'm trying to do as though it's for a search engine... it's not... i'm looking to create a crawler to extract specific information. as such, i need to emulate some of the function of a crawler. i also need to implement other functionality that's apparently not in the usual spider/crawler function. being able to selectively iterate/follow through forms (GET/POST) in a recursive manner is a requirement. as is being able to selectively define which form elements i'm going to use when i do the crawling.... of course this approach is only possible because i have causal knowledge of the structure of the site prior to me crawling it... -bruce -----Original Message----- From: Fuad Efendi Sent: Friday, June 23, 2006 8:28 PM To: [email protected] Subject: RE: nutch - functionality.. Nutch is plugin-based, similar to Eclipse; You can extend Nutch functionality, just browse src/plugin/parse-html source folder as a sample; you can modify Java code so that it will handle 'POST' from forms (Outlink class instances) (I am well familiar with v.0.7.1, new version of Nutch is significantly richer) It's the easiest starting point: parse-html plugin... I don't see any reason why search engine should return list of pages found (including POST of forms), and <A href="...">PageFound</A> (of a Nutch Search Results end-user screen) does not have functionality of POST. Only one case: response may provide new Outlink instances, such as response from 'Search' pages of E-Commerce sites... And most probably such 'second-level' outlinks are reachable via GET; sample - 'Search' page with POST on any E-Commerce site... -----Original Message----- From: bruce Subject: nutch - functionality.. hi... i might be a little out of my league.. but here goes... i'm in need of an app to crawl through sections of sites, and to return pieces of information. i'm not looking to do any indexing, just returning raw html/text... however, i need the ability to set certain criteria to help define the actual pages that get returned... a given crawling process, would normally start at some URL, and iteratively fetch files underneath the URL. nutch does this as well as providing some additional functionality. i need more functionality.... in particular, i'd like to be able to modify the way nutch handles forms, and links/queries on a given page. i'd like to be able to: for forms: allow the app to handle POST/GET forms allow the app to select (implement/ignore) given elements within a form track the FORM(s) for a given URL/page/level of the crawl for links: allow the app to either include/exclude a given link for a given page/URL via regex parsing or list of URLs allow the app to handle querystring data, ie to include/exclude the URL+Query based on regex parsing or simple text comparison data extraction: abiility to do xpath/regex extraction based on the DOM permit multiple xpath/regex functions to be run on a given page this kind of functionality would allow the 'nutch' function to be relatively selective regarding the ability to crawl through a site and extract the required information.... any thoughts/comments/ideas/etc.. regarding this process. if i shouldn't use nutch, are there any suggestions as to what app i should use. thanks -bruce Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
