Andrzej,

Thanks for the feedback, i've pulled in your
recommendation but still not getting it right. I'm a
newb when it comes to trying this out myself :)

Here is my code:

<%@ page
contentType="text/xml; charset=UTF-8"
pageEncoding="UTF-8"

  import="javax.servlet.*"
  import="javax.servlet.http.*"
  import="java.io.*"
  import="java.util.*"
  import="java.net.*"

  import="org.apache.nutch.html.Entities"
  import="org.apache.nutch.searcher.*"
  import="org.apache.nutch.plugin.*"
  import="org.apache.nutch.clustering.*"
  import="org.apache.nutch.util.NutchConf"


%><%

NutchBean bean = NutchBean.get(application);

// set the character encoding to use when interpreting
request values
request.setCharacterEncoding("UTF-8");

bean.LOG.info("OpenSeach query request from " + 
request.getRemoteAddr());

// get query from request
String queryString = request.getParameter("query");
if (queryString == null) queryString = "";

// first hit to display
int start = 0;
int startPage = 0;
String startString = request.getParameter("start");
if (startString != null) start =
Integer.parseInt(startString);

// number of hits to display
int hitsPerPage = 10;
String hitsString =
request.getParameter("hitsPerPage");
if (hitsString != null) hitsPerPage =
Integer.parseInt(hitsString);

// max hits per site
int hitsPerSite = 2;
String hitsPerSiteString =
request.getParameter("hitsPerSite");
if (hitsPerSiteString != null) hitsPerSite =
Integer.parseInt(hitsPerSiteString);

Query query = Query.parse(queryString);
bean.LOG.info("OpenSearch query: " + queryString);

// perform query

Hits hits = bean.search(query, start + hitsPerPage,
hitsPerSite);

// Last hit in the page
int end = start + hitsPerPage - 1;
if (end > hits.getLength() - 1) end = hits.getLength()
- 1;

// Total length in the page
int length = 0;

if (start < end)
    length = end - start + 1;

bean.LOG.info("total hits: " + hits.getTotal());

%><?xml version="1.0" encoding="UTF-8"?>
<rss
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/";

version="2.0">
<%
  // To prevent the character encoding declared with
'contentType' page
  // directive from being overriden by JSTL (apache
i18n), we freeze it
  // by flushing the output buffer.
  // see 
http://java.sun.com/developer/technicalArticles/Intl/MultilingualJSP/
  out.flush();
%>
<channel>
    <title>Mozdex.com: Open Search Engine</title>
    
<link>http://www.mozdex.com/open.jsp?query=<%=Entities.encode(queryString)%></link>
    <description>Search for
<%=Entities.encode(queryString)%> via 
Mozdex.com</description>
    <language>en-us</language>
    <copyright>Copyright(c) 2005 Small
Productions</copyright>
    <openSearch:totalResults><%=new 
Long(hits.getTotal())%></openSearch:totalResults>
    <openSearch:startIndex><%=new
Long(start)%></openSearch:startIndex>
   
<openSearch:itemsPerPage><%=hitsPerPage%></openSearch:itemsPerPage>
<%
if (length > 0) {

    Hit[] show = hits.getHits(start, length);
    HitDetails[] details = bean.getDetails(show);
    String[] summaries = bean.getSummary(details,
query);

    // display the hits
    for (int i = 0; i < length; i++) {

        Hit hit = show[i];
        HitDetails detail = details[i];
        String title = detail.getValue("title");
        String url = detail.getValue("url");
      //  String summary = 
summaries[i].replaceAll("([\t\n\r]|&nbsp;){2,}", " ");
        String summxml = new
String(summaries[i].getBytes(),"UTF-8");

        // use url for docs w/o title
        if (title == null || title.equals("")) title =
url;
%>
        <item>
            <title><![CDATA[<%=title%>]]></title>
            <link><![CDATA[<%=url%>]]></link>
            <guid
isPermaLink="true"><![CDATA[<%=url%>]]></guid>
           
<description><![CDATA[<%=summxml%>]]></description>
        </item>
<%
    }
}
%>

</channel>
</rss>

--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Byron Miller wrote:
> 
> > Oh yeah, does anyone have any tips on cleaning up
> the SUMMARIES so any
> > lingering code, cntrl characters or non XML valid
> characters don't come
> > through?  The following search causes it to barf:
> > 
> > http://www.mozdex.com/open.jsp?query=opensearch
> 
> Well, the problem with this particular offending
> page comes from the 
> fact that the original HTML content had a different
> encoding than 
> expected, so some non-latin characters ended up as
> control characters 
> after invalid re-encoding.
> 
> But if you ignore this for a moment, the XML error
> comes from the fact 
> that this offending character falls outside the
> declared encoding, which 
> is Latin1.
> 
> Is there any particular reason why you use
> ISO-8859-1 instead of UTF-8? 
> I think you need to use the latter in order to
> properly present 
> international content. And then, you need to encode
> the data that you 
> put in the response so that it follows the UTF-8
> encoding - whether 
> through your servlet container, or by simply calling
> 
> String.getBytes("UTF-8") and writing these to the
> output...
> 
> -- 
> Best regards,
> Andrzej Bialecki
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 
>
-------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT
> Products from real users.
> Discover which products truly live up to the hype.
> Start reading now.
>
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
>
https://lists.sourceforge.net/lists/listinfo/nutch-general
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Reply via email to