Since we have such strange plugin structure (DI? IoC?), and many utility classes with a single UNIX shell script to run everything...
1. Separate concerns. Clearly. - Crawl - Parse - Generate URL List - Crawl - ... (Interfaces of WebDB should be more clear, so we can use databases, etc,...) 1a. Data Mining (finding new language constructs) 2. Automate Classification - Anchor text is the true subject of a page - Page contains anchors - Anchor Text is The Class of referenced pages Sample: the page "Network Cards" has referenced pages. The page "Computer Hardware" has a link with anchor text "Network Cards". 3. Data Mining (???) - String Tokenization - Sentence - Human Language - AJAX, Red Rouge, Opteron, Break Barrel, Caviar, The Jacobian Conjecture, ... - different language constructs for different sites? Nothing "Agile". Many staff changed in a trunk, such as 'Link' and 'WebDB', it simplifies... Thanks ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
