> > Chris Mattmann wrote: > > Complete agree here: RSS makes a lot of sense since it's really a > > standard for representing a "list of items" and their respective > > metadata. This list of items is basically the main product of the > > search engine, i.e., it's list of results. I think that if > by default > > search.jsp produced RSS output, rather than HTML, Nutch > would also be > > more attractive as an API to plug in to, and Nutch could be > one of the > > standard components in some system of systems architecture. > > Can someone draft a specification of what the Nutch RSS > output should look like? I think this should be based on > what search.jsp currently produces. The output should not be > internationalized, rather that can be done by stylesheets. > > In particular, what's needed beyond RSS and A9's OpenSearch > extensions? > Under <channel/> we need navigation urls, for next-page, > show-all-hits, clustering, etc. Under <item/> we need urls > for cache, explain, and more-from-site. Is there more? > > Once we have some agreement about what should be returned, > then we need a volunteer to implement it! > > Doug >
I went through search.jsp and tried to replcate the data and all functionality and came up with the following. Does it look sufficient (and correct)? There are a few possible issues that I can see with it: 1) Should "itemsPerPage" be the number of items requested or the number returned? 2) MSN search's RSS format (http://search.msn.com/results.aspx?q=query&format=rss) includes a pubDate. This may be the date MSN last retrieved that URL (although it isn't documented anywhere, but it is the only thing that seems to make sense). Does Nutch store that information (and is it a good idea to include it?) 3) What namespace URL should the "nutch" namespace use? 4) I thought it might be a good idea to pass a "format" parameter so that it could support other formats (eg Atom) in the future. Again, MSN does the same thing (format=rss or format=xml. Interestingly "format=blah" is ignored, but format=atom causes an error) 5) Yahoo's REST based search API (http://developer.yahoo.net/web/V1/webSearch.html) uses its own XML format. I don't see many advantages in using that (or something similar) for Nutch, but I could be missing something. 6) It could be argued that some of the RSS extensions here (in particular nutch:nextPageUrl and nutch:prevPageUrl) would be useful to standardize across search engines and so should not be in the nutch namespace. If anyone thinks that is important then they could be moved. <?xml version="1.0"?> <rss version="2.0" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:nutch="http://www.nutch.org/some-url/"> <channel> <title>Nutch: search term</title> <link>http://baseurl/xmlsearch?q=search%20term&format=rss20</link> <description>Search results for "search term"</description> <language>en-us</language> <copyright>&copy; current year, Nutch.</copyright> <openSearch:totalResults>hits.getTotal()</openSearch:totalResults> <openSearch:startIndex>start from number</openSearch:startIndex> <openSearch:itemsPerPage>request.getParameter("hitsPerPage")</openSearch :itemsPerPage> // Default to 10. No max currently defined. Perhaps should use logic like = (clusteringAvailable && clustering.equals("yes") ? hitsToCluster : hitsPerPage); <nutch:searchTerm>search term</nutch:searchTerm> <nutch:hitsPerSite>2</nutch:hitsPerSite> // default to 2 <nutch:clustering>yes/no</nutch:clustering> // default to no <nutch:nextPageUrl>xmlsearch?q=search%20&format=rss20&hitsPerPage=10&sta rt=101&clustering=yes&hitsPerSite=2</nutch:nextPageUrl> <nutch:prevPageUrl>xmlsearch?q=search%20&format=rss20&hitsPerPage=10&sta rt=91&clustering=yes&hitsPerSite=2</nutch:prevPageUrl> <item> <title>detail.getValue("title") or url if title null or blank</title> <link>detail.getValue("url")</link> <description>summary</description> <nutch:cachedUrl>http://baseurl/cached.jsp?idx=hit.getIndexNo()&id=hit.g etIndexDocNo()</nutch:cachedUrl> <nutch:explainUrl>http://baseurl/explain.jsp?idx=hit.getIndexNo()&id=hit getIndexDocNo()</nutch:explainUrl> <nutch:anchorsUrl>http://baseurl/anchors.jsp?idx=hit.getIndexNo()&id=hit getIndexDocNo()</nutch:anchorsUrl> <nutch:moreFromSiteUrl>http://baseurl/xmlsearch?q=URLEncoder.encode("sit e:" + hit.getSite() + " " + queryString ) + "&format=rss20&hitsPerPage=10&start=91&clustering=yes&hitsPerSite=0</nut ch:moreFromSiteUrl> </item> </channel> </rss> Regards Nick Lothian IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of education.au limited except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email. ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
