Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp ------------------------------------------------------------------------------ - We're planning to have a "Web Crawler Developer" !MeetUp at this year's ApacheCon US in Oakland. + We're planning to have a "Web Crawler Developer" !MeetUp at this year's [http://www.us.apachecon.com/c/acus2009/ ApacheCon US] in Oakland. Tentative plan is for Thursday evening, November 5th. The actual schedule for !MeetUps is [http://wiki.apache.org/apachecon/ApacheMeetupsUs09 here]. @@ -11, +11 @@ * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. * robots.txt processing - current problems with existing implementations * Avoiding crawler traps - link farms, honeypots, etc. - * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping + * Parsing content - home grown, Neko/!TagSoup, Tika, screen scraping * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) * Testing challenges - is it possible to unit test a crawler? * Fuzzy classification - mime-type, charset, language.