Hi,

I'm trying to create indexes on a custom C2.0.1 webapp. Currently my
/search/ directory is a copy of the example provided, but when I try and
create an index using create-index.xsp I get the following error:

DEBUG   (2002-02-11) 23:06.55:344   [core] (/search/create) 
Ajp13Processor[8009][1]/SimpleLuceneXMLIndexerImpl: Ignoring 
http://foo.internal.luminas.co.uk:80/welcome.html?cocoon-view=content (text/xml; 
charset=iso-8859-1)

As far as I can tell, lucene is ignoring the page because it has a content
type of "text/xml; charset=iso-8859-1" and not just "text/xml". I've
hacked cocoon to accept a content-type and an encoding, but it doesn't
seem like the right way to do it.

So, questions:

1) Is there any way to tell cocoon to turn off charset additions to the
Content-type header?

2) If there isn't, it looks like the Lucene bits will need modifying to
either add to the allowedContentType or to do more parsing of the
contentType string. Any thoughts on the best way to do this? I've attached
patches that do this, but it's probably not the most elegant solution...


Andrew.

-- 
Andrew Savory                                Email: [EMAIL PROTECTED]
Managing Director                              Tel:  +44 (0)20 8553 6622
Luminas Internet Applications                  Fax:  +44 (0)870 28 47489
This is not an official statement or order.    Web:    www.luminas.co.uk



Index: SimpleCocoonCrawlerImpl.java
===================================================================
RCS file: 
/home/cvspublic/xml-cocoon2/src/java/org/apache/cocoon/components/crawler/SimpleCocoonCrawlerImpl.java,v

retrieving revision 1.6
diff -u -r1.6 SimpleCocoonCrawlerImpl.java
--- SimpleCocoonCrawlerImpl.java        4 Feb 2002 12:22:21 -0000       1.6
+++ SimpleCocoonCrawlerImpl.java        12 Feb 2002 15:11:37 -0000
@@ -82,6 +82,7 @@
 import java.util.Iterator;
 import java.util.List;
 import java.util.ArrayList;
+import java.util.StringTokenizer;
 
 /**
  * A simple cocoon crawler.
@@ -485,6 +486,15 @@
             BufferedReader br = new BufferedReader(new InputStreamReader(is));
 
             String content_type = links_url_connection.getContentType();
+            // split Content-Type header into type and encoding
+            if (content_type != null) {
+                    StringTokenizer st = new StringTokenizer (content_type, ";");
+                    if (st.countTokens() < 2) {
+                            content_type = st.nextToken();
+                    }
+                    content_type = st.nextToken();
+                    String content_encoding = st.nextToken();
+            }
             if (getLogger().isDebugEnabled()) {
                 getLogger().debug("Content-type: " + content_type);
             }
Index: SimpleLuceneXMLIndexerImpl.java
===================================================================
RCS file: 
/home/cvspublic/xml-cocoon2/src/java/org/apache/cocoon/components/search/SimpleLuceneXMLIndexerImpl.java,v

retrieving revision 1.5
diff -u -r1.5 SimpleLuceneXMLIndexerImpl.java
--- SimpleLuceneXMLIndexerImpl.java     4 Feb 2002 12:31:09 -0000       1.5
+++ SimpleLuceneXMLIndexerImpl.java     12 Feb 2002 15:12:03 -0000
@@ -64,6 +64,7 @@
 import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
+import java.util.StringTokenizer;
 
 import javax.xml.parsers.*;
 
@@ -179,6 +180,15 @@
                 + CONTENT_QUERY);
             URLConnection contentURLConnection = contentURL.openConnection();
             String contentType = contentURLConnection.getContentType();
+            // split Content-Type header into type and encoding
+            if (contentType != null) {
+                    StringTokenizer st = new StringTokenizer (contentType, ";");
+                    if (st.countTokens() < 2) {
+                            contentType = st.nextToken();
+                    }
+                    contentType = st.nextToken();
+                    String contentEncoding = st.nextToken();
+            }
             if (contentType != null &&
                     allowedContentType.contains(contentType)) {
 
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to