I would like to propose producing support for IBM Lotus Domino web sites. These sites are not uncommon, especially in the U.S. Government and Fortune 1000 intranets.
Problem: Although the Domino application server produces content as HTML (by default), it is often not easily indexed by crawlers because the content is dynamic and often accessible only through queries as opposed to HREFs. Additionally, there are often parameters accessed via "?" syntax in the URL and often crawlers ignore these URLs as they are traditionally used for CGI access. An additional issue is the Domino sites often so not contain complete meta tag information because internal searching for author, keywords, etc. is done via the Domino engine and because Domino developers are often remiss in taking the extra steps required to produce the meta tags.
Proposed solution: Produce a crawler that access a particular "view" (a Domino list) that site developer's would produce for use by Nutch (or any other engine that wanted to use it), which would provide the meta information, title, and link for each document (page) that they wanted indexed. The name of this view could communicated to the crawler in the robots file. This entry would not only tell the crawler where the view was located, but also that the site was a Domino site and should be process as one (current versions of Domino do not add any identifying information in the HTML they produce for security reasons). If this item is discovered by the crawler it would process the site by accessing the specified view as XML, parsing the XML, iterating through each entry, retrieving the meta data and the content, and storing the information in the search index. This would be the piece I would undertake to produce (although I have not yet looked at the current system code to see what is involved - hopefully it would not be beyond my capacity).
I am not sure what the next steps would be, so if this is of interest I would appreciate someone saying "OK, so now do XXX". I have subscribed to the developer list.
Regards,
Scott McIntosh
Technical Director
ICF Consulting
[EMAIL PROTECTED]
http://www.icfconsulting.com
