Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KenKrugler: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp ------------------------------------------------------------------------------ Below are some potential topics for discussion - feel free to add/comment. - * Potential synergies between crawler projects - e.g. sharing robots.txt processing code. + * Potential synergies between crawler projects - e.g. sharing robots.txt processing code. - * How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite. + * How to avoid end-user abuse - webmasters sometimes block crawlers because users configure it to be impolite. - * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. + * Politeness vs. efficiency - various options for how to be considered polite, while still crawling quickly. - * robots.txt processing - current problems with existing implementations + * robots.txt processing - current problems with existing implementations - * Avoiding crawler traps - link farms, honeypots, etc. + * Avoiding crawler traps - link farms, honeypots, etc. - * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping + * Parsing content - home grown, Neko/TagSoup, Tika, screen scraping - * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) + * Search infrastructure - options for serving up crawl results (Nutch, Solr, Katta, others?) - * Testing challenges - is it possible to unit test a crawler? + * Testing challenges - is it possible to unit test a crawler? - * Fuzzy classification - mime-type, charset, language. + * Fuzzy classification - mime-type, charset, language. - * The future of Nutch, Droids, Heritrix, Bixo, etc. + * The future of Nutch, Droids, Heritrix, Bixo, etc. - * Optimizing for types of crawling - intranet, focused, whole web. + * Optimizing for types of crawling - intranet, focused, whole web.