Andrzej, Thanks for the feedback, i've pulled in your recommendation but still not getting it right. I'm a newb when it comes to trying this out myself :)
Here is my code: <%@ page contentType="text/xml; charset=UTF-8" pageEncoding="UTF-8" import="javax.servlet.*" import="javax.servlet.http.*" import="java.io.*" import="java.util.*" import="java.net.*" import="org.apache.nutch.html.Entities" import="org.apache.nutch.searcher.*" import="org.apache.nutch.plugin.*" import="org.apache.nutch.clustering.*" import="org.apache.nutch.util.NutchConf" %><% NutchBean bean = NutchBean.get(application); // set the character encoding to use when interpreting request values request.setCharacterEncoding("UTF-8"); bean.LOG.info("OpenSeach query request from " + request.getRemoteAddr()); // get query from request String queryString = request.getParameter("query"); if (queryString == null) queryString = ""; // first hit to display int start = 0; int startPage = 0; String startString = request.getParameter("start"); if (startString != null) start = Integer.parseInt(startString); // number of hits to display int hitsPerPage = 10; String hitsString = request.getParameter("hitsPerPage"); if (hitsString != null) hitsPerPage = Integer.parseInt(hitsString); // max hits per site int hitsPerSite = 2; String hitsPerSiteString = request.getParameter("hitsPerSite"); if (hitsPerSiteString != null) hitsPerSite = Integer.parseInt(hitsPerSiteString); Query query = Query.parse(queryString); bean.LOG.info("OpenSearch query: " + queryString); // perform query Hits hits = bean.search(query, start + hitsPerPage, hitsPerSite); // Last hit in the page int end = start + hitsPerPage - 1; if (end > hits.getLength() - 1) end = hits.getLength() - 1; // Total length in the page int length = 0; if (start < end) length = end - start + 1; bean.LOG.info("total hits: " + hits.getTotal()); %><?xml version="1.0" encoding="UTF-8"?> <rss xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" version="2.0"> <% // To prevent the character encoding declared with 'contentType' page // directive from being overriden by JSTL (apache i18n), we freeze it // by flushing the output buffer. // see http://java.sun.com/developer/technicalArticles/Intl/MultilingualJSP/ out.flush(); %> <channel> <title>Mozdex.com: Open Search Engine</title> <link>http://www.mozdex.com/open.jsp?query=<%=Entities.encode(queryString)%></link> <description>Search for <%=Entities.encode(queryString)%> via Mozdex.com</description> <language>en-us</language> <copyright>Copyright(c) 2005 Small Productions</copyright> <openSearch:totalResults><%=new Long(hits.getTotal())%></openSearch:totalResults> <openSearch:startIndex><%=new Long(start)%></openSearch:startIndex> <openSearch:itemsPerPage><%=hitsPerPage%></openSearch:itemsPerPage> <% if (length > 0) { Hit[] show = hits.getHits(start, length); HitDetails[] details = bean.getDetails(show); String[] summaries = bean.getSummary(details, query); // display the hits for (int i = 0; i < length; i++) { Hit hit = show[i]; HitDetails detail = details[i]; String title = detail.getValue("title"); String url = detail.getValue("url"); // String summary = summaries[i].replaceAll("([\t\n\r]| ){2,}", " "); String summxml = new String(summaries[i].getBytes(),"UTF-8"); // use url for docs w/o title if (title == null || title.equals("")) title = url; %> <item> <title><![CDATA[<%=title%>]]></title> <link><![CDATA[<%=url%>]]></link> <guid isPermaLink="true"><![CDATA[<%=url%>]]></guid> <description><![CDATA[<%=summxml%>]]></description> </item> <% } } %> </channel> </rss> --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Byron Miller wrote: > > > Oh yeah, does anyone have any tips on cleaning up > the SUMMARIES so any > > lingering code, cntrl characters or non XML valid > characters don't come > > through? The following search causes it to barf: > > > > http://www.mozdex.com/open.jsp?query=opensearch > > Well, the problem with this particular offending > page comes from the > fact that the original HTML content had a different > encoding than > expected, so some non-latin characters ended up as > control characters > after invalid re-encoding. > > But if you ignore this for a moment, the XML error > comes from the fact > that this offending character falls outside the > declared encoding, which > is Latin1. > > Is there any particular reason why you use > ISO-8859-1 instead of UTF-8? > I think you need to use the latter in order to > properly present > international content. And then, you need to encode > the data that you > put in the response so that it follows the UTF-8 > encoding - whether > through your servlet container, or by simply calling > > String.getBytes("UTF-8") and writing these to the > output... > > -- > Best regards, > Andrzej Bialecki > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT > Products from real users. > Discover which products truly live up to the hype. > Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-general mailing list > Nutch-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nutch-general > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com