Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "ApacheConUs2009MeetUp" page has been changed by KenKrugler. http://wiki.apache.org/nutch/ApacheConUs2009MeetUp?action=diff&rev1=5&rev2=6 -------------------------------------------------- - We were planning to have a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. + We had a "Web Crawler Developer" !MeetUp at this year's [[http://www.us.apachecon.com/c/acus2009/|ApacheCon US]] in Oakland. - Unfortunately the only time slot where people would be around was Thursday night, which wound up conflicting with the Hadoop !MeetUp. + It wound up being an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. - So we're going to have an !UnMeetUp (!MeetDown?) on Wednesday, November 4th from 11am - 1pm. Location is TBD, hopefully we can get some space at the event but might be a lunch meeting :) + == Attendees == + + * Andrzej Bialeki - Apache Nutch + * Thorsten xxx - Apache Droids + * Michael Stack - Formerly with Heritrix, now HBase + * Ken Krugler - Bixo + + == Topics == + + === Roadmaps === + + Nutch - become more component based. + Droids - get more people involved. + + === Sharable Components === + + * robots.txt parsing + * URL normalization + * URL filtering + * Page cleansing + * General purpose + * Specialized + * Sub-page parsing (portlets) + * AJAX-ish page interactions + * Document parsing (via Tika) + * HttpClient (configuration) + * Text similarity + * Mime/charset/language detection + + === Tika === + + * Needs help to become really usable + * Would benefit from large test corpus + * Could do comparison with Nutch parser + * Needs option for direct DOM querying (screen scraping tasks) + * Handles mime & charset detection now (some issues) + * Could be extended to include language detection (wrap other impl) + + === URL Normalization === + + * Includes both domain (www.x.com == x.com), path, and query portions of URL + * Often site-specific rules + * Option to derive rules using URLs to similar documents. + + === AJAX-ish Page Interaction === + + * Not applicable for broad/general crawling + * Can be very important for specific web sites + * Use Selenium or headless Mozilla + + === Component API Issues === + + * Want to avoid using an API that's tied too closely to any implementation. + * One option is to have simple (e.g. URL param) API that takes meta-data. + * Similar to Tika passing in of meta-data. + + === Hosting Options === + + * As part of Nutch - but easy to get lost in Nutch codebase, and can be associated too closely with Nutch. + * As part of Droids - but Droids is both a framework (queue-based) and set of components. + * New sub-project under Lucene TLP - but overhead to set up/maintain, and then confusion between it and Droids. + * Google code - seems like a good short-term solution, to judge level of interest and help shake out issues. + + == Next Steps == + + * Get input from Gordon re Heritrix. Stack to follow up with him. Ideally he'd add his comments to this page. + * Get input from Thorsten on Google code option. If OK as starting point, then Andrzej to set up. + * Make decision about build system (and then move on to code formatting debate :)) + * I'm going to propose ant + maven ant tasks for dependency management. I'm using this with Bixo, and so far it's been pretty good. + * Start contributing code + * Ken will put in robots.txt parser. + + == Original Discussion Topic List == Below are some potential topics for discussion - feel free to add/comment.