> 
> Chris Mattmann wrote:
> > Complete agree here: RSS makes a lot of sense since it's really a 
> > standard for representing a "list of items" and their respective 
> > metadata. This list of items is basically the main product of the 
> > search engine, i.e., it's list of results. I think that if 
> by default 
> > search.jsp produced RSS output, rather than HTML, Nutch 
> would also be 
> > more attractive as an API to plug in to, and Nutch could be 
> one of the 
> > standard components in some system of systems architecture.
> 
> Can someone draft a specification of what the Nutch RSS 
> output should look like?  I think this should be based on 
> what search.jsp currently produces.  The output should not be 
> internationalized, rather that can be done by stylesheets.
> 
> In particular, what's needed beyond RSS and A9's OpenSearch 
> extensions? 
>   Under <channel/> we need navigation urls, for next-page, 
> show-all-hits, clustering, etc.  Under <item/> we need urls 
> for cache, explain, and more-from-site.  Is there more?
> 
> Once we have some agreement about what should be returned, 
> then we need a volunteer to implement it!
> 
> Doug
> 

I went through search.jsp and tried to replcate the data and all
functionality and came up with the following. Does it look sufficient
(and correct)?

There are a few possible issues that I can see with it:

1) Should "itemsPerPage" be the number of items requested or the number
returned?

2) MSN search's RSS format
(http://search.msn.com/results.aspx?q=query&format=rss) includes a
pubDate. This may be the date MSN last retrieved that URL (although it
isn't documented anywhere, but it is the only thing that seems to make
sense). Does Nutch store that information (and is it a good idea to
include it?)

3) What namespace URL should the "nutch" namespace use?

4) I thought it might be a good idea to pass a "format" parameter so
that it could support other formats (eg Atom) in the future. Again, MSN
does the same thing (format=rss or format=xml. Interestingly
"format=blah" is ignored, but format=atom causes an error)

5) Yahoo's REST based search API
(http://developer.yahoo.net/web/V1/webSearch.html) uses its own XML
format. I don't see many advantages in using that (or something similar)
for Nutch, but I could be missing something.

6) It could be argued that some of the RSS extensions here (in
particular nutch:nextPageUrl and nutch:prevPageUrl) would be useful to
standardize across search engines and so should not be in the nutch
namespace. If anyone thinks that is important then they could be moved.

<?xml version="1.0"?>
  <rss version="2.0"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/";
xmlns:nutch="http://www.nutch.org/some-url/";>
    <channel>
      <title>Nutch: search term</title>
      <link>http://baseurl/xmlsearch?q=search%20term&format=rss20</link>
      <description>Search results for "search term"</description>
      <language>en-us</language>
      <copyright>&amp;copy; current year, Nutch.</copyright>
      
      <openSearch:totalResults>hits.getTotal()</openSearch:totalResults>
      <openSearch:startIndex>start from number</openSearch:startIndex>
 
<openSearch:itemsPerPage>request.getParameter("hitsPerPage")</openSearch
:itemsPerPage> // Default to 10. No max currently defined. Perhaps
should use logic like = (clusteringAvailable && clustering.equals("yes")
? hitsToCluster : hitsPerPage);
      
      <nutch:searchTerm>search term</nutch:searchTerm>
      <nutch:hitsPerSite>2</nutch:hitsPerSite> // default to 2
      <nutch:clustering>yes/no</nutch:clustering> // default to no
 
<nutch:nextPageUrl>xmlsearch?q=search%20&format=rss20&hitsPerPage=10&sta
rt=101&clustering=yes&hitsPerSite=2</nutch:nextPageUrl>      
 
<nutch:prevPageUrl>xmlsearch?q=search%20&format=rss20&hitsPerPage=10&sta
rt=91&clustering=yes&hitsPerSite=2</nutch:prevPageUrl>
      
      <item>
        <title>detail.getValue("title") or url if title null or
blank</title>
        <link>detail.getValue("url")</link>
        <description>summary</description>
 
<nutch:cachedUrl>http://baseurl/cached.jsp?idx=hit.getIndexNo()&id=hit.g
etIndexDocNo()</nutch:cachedUrl>
 
<nutch:explainUrl>http://baseurl/explain.jsp?idx=hit.getIndexNo()&id=hit
getIndexDocNo()</nutch:explainUrl>
 
<nutch:anchorsUrl>http://baseurl/anchors.jsp?idx=hit.getIndexNo()&id=hit
getIndexDocNo()</nutch:anchorsUrl>
 
<nutch:moreFromSiteUrl>http://baseurl/xmlsearch?q=URLEncoder.encode("sit
e:" + hit.getSite() + " " + queryString ) +
"&format=rss20&hitsPerPage=10&start=91&clustering=yes&hitsPerSite=0</nut
ch:moreFromSiteUrl>
      </item>      
    </channel>
  </rss>    

Regards
  Nick Lothian


IMPORTANT: This e-mail, including any attachments, may contain private or 
confidential information. If you think you may not be the intended recipient, 
or if you have received this e-mail in error, please contact the sender 
immediately and delete all copies of this e-mail. If you are not the intended 
recipient, you must not reproduce any part of this e-mail or disclose its 
contents to any other party.
This email represents the views of the individual sender, which do not 
necessarily reflect those of education.au limited except where the sender 
expressly states otherwise.
It is your responsibility to scan this email and any files transmitted with it 
for viruses or any other defects.
education.au limited will not be liable for any loss, damage or consequence 
caused directly or indirectly by this email. 


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to