Andrzej,
Thanks for the feedback, i've pulled in your
recommendation but still not getting it right. I'm a
newb when it comes to trying this out myself :)
Here is my code:
<%@ page
contentType="text/xml; charset=UTF-8"
pageEncoding="UTF-8"
import="javax.servlet.*"
import="javax.servlet.http.*"
import="java.io.*"
import="java.util.*"
import="java.net.*"
import="org.apache.nutch.html.Entities"
import="org.apache.nutch.searcher.*"
import="org.apache.nutch.plugin.*"
import="org.apache.nutch.clustering.*"
import="org.apache.nutch.util.NutchConf"
%><%
NutchBean bean = NutchBean.get(application);
// set the character encoding to use when interpreting
request values
request.setCharacterEncoding("UTF-8");
bean.LOG.info("OpenSeach query request from " +
request.getRemoteAddr());
// get query from request
String queryString = request.getParameter("query");
if (queryString == null) queryString = "";
// first hit to display
int start = 0;
int startPage = 0;
String startString = request.getParameter("start");
if (startString != null) start =
Integer.parseInt(startString);
// number of hits to display
int hitsPerPage = 10;
String hitsString =
request.getParameter("hitsPerPage");
if (hitsString != null) hitsPerPage =
Integer.parseInt(hitsString);
// max hits per site
int hitsPerSite = 2;
String hitsPerSiteString =
request.getParameter("hitsPerSite");
if (hitsPerSiteString != null) hitsPerSite =
Integer.parseInt(hitsPerSiteString);
Query query = Query.parse(queryString);
bean.LOG.info("OpenSearch query: " + queryString);
// perform query
Hits hits = bean.search(query, start + hitsPerPage,
hitsPerSite);
// Last hit in the page
int end = start + hitsPerPage - 1;
if (end > hits.getLength() - 1) end = hits.getLength()
- 1;
// Total length in the page
int length = 0;
if (start < end)
length = end - start + 1;
bean.LOG.info("total hits: " + hits.getTotal());
%><?xml version="1.0" encoding="UTF-8"?>
<rss
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
version="2.0">
<%
// To prevent the character encoding declared with
'contentType' page
// directive from being overriden by JSTL (apache
i18n), we freeze it
// by flushing the output buffer.
// see
http://java.sun.com/developer/technicalArticles/Intl/MultilingualJSP/
out.flush();
%>
<channel>
<title>Mozdex.com: Open Search Engine</title>
<link>http://www.mozdex.com/open.jsp?query=<%=Entities.encode(queryString)%></link>
<description>Search for
<%=Entities.encode(queryString)%> via
Mozdex.com</description>
<language>en-us</language>
<copyright>Copyright(c) 2005 Small
Productions</copyright>
<openSearch:totalResults><%=new
Long(hits.getTotal())%></openSearch:totalResults>
<openSearch:startIndex><%=new
Long(start)%></openSearch:startIndex>
<openSearch:itemsPerPage><%=hitsPerPage%></openSearch:itemsPerPage>
<%
if (length > 0) {
Hit[] show = hits.getHits(start, length);
HitDetails[] details = bean.getDetails(show);
String[] summaries = bean.getSummary(details,
query);
// display the hits
for (int i = 0; i < length; i++) {
Hit hit = show[i];
HitDetails detail = details[i];
String title = detail.getValue("title");
String url = detail.getValue("url");
// String summary =
summaries[i].replaceAll("([\t\n\r]| ){2,}", " ");
String summxml = new
String(summaries[i].getBytes(),"UTF-8");
// use url for docs w/o title
if (title == null || title.equals("")) title =
url;
%>
<item>
<title><![CDATA[<%=title%>]]></title>
<link><![CDATA[<%=url%>]]></link>
<guid
isPermaLink="true"><![CDATA[<%=url%>]]></guid>
<description><![CDATA[<%=summxml%>]]></description>
</item>
<%
}
}
%>
</channel>
</rss>
--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Byron Miller wrote:
>
> > Oh yeah, does anyone have any tips on cleaning up
> the SUMMARIES so any
> > lingering code, cntrl characters or non XML valid
> characters don't come
> > through? The following search causes it to barf:
> >
> > http://www.mozdex.com/open.jsp?query=opensearch
>
> Well, the problem with this particular offending
> page comes from the
> fact that the original HTML content had a different
> encoding than
> expected, so some non-latin characters ended up as
> control characters
> after invalid re-encoding.
>
> But if you ignore this for a moment, the XML error
> comes from the fact
> that this offending character falls outside the
> declared encoding, which
> is Latin1.
>
> Is there any particular reason why you use
> ISO-8859-1 instead of UTF-8?
> I think you need to use the latter in order to
> properly present
> international content. And then, you need to encode
> the data that you
> put in the response so that it follows the UTF-8
> encoding - whether
> through your servlet container, or by simply calling
>
> String.getBytes("UTF-8") and writing these to the
> output...
>
> --
> Best regards,
> Andrzej Bialecki
> ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/| Information Retrieval,
> Semantic Web
> ___|||__|| \| || | Embedded Unix, System
> Integration
> http://www.sigram.com Contact: info at sigram dot
> com
>
>
>
>
-------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT
> Products from real users.
> Discover which products truly live up to the hype.
> Start reading now.
>
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-general mailing list
> [email protected]
>
https://lists.sourceforge.net/lists/listinfo/nutch-general
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com