Fwd: Re: 1.6.2 : JCR-2645 XML text extraction in Jackrabbit 1.x accesses external resources

Maxime Bégnis Thu, 22 Jul 2010 04:31:43 -0700

Hello,

I posted these mails on the users mailing list, while it may belong onthe development one. I got no answers so far.


Thanks.

Maxime Bégnis

-------- Message original --------

Sujet: Re: 1.6.2 : JCR-2645 XML text extraction in Jackrabbit 1.xaccesses external resources

Date :  Mon, 19 Jul 2010 13:35:44 +0200
De :    Maxime Bégnis <[email protected]>
Répondre à :    [email protected]

Pour : [email protected], Jukka Zitting<[email protected]>




Hi all,

Concerning this problem we patched the source ofjackrabbit-text-extractors/src/main/java/org/apache/jackrabbit/extractor/XMLTextExtractor.javaof version 1.6.2 by adding the following after line 88:


reader.setEntityResolver(handler);

This has worked very well for us so far. Could someone tell us if thereis a risk of breaking something else somewhere else with this?


Attached to this mail the patched XMLTextExtractor.java we are using.

Thanks.

Maxime Bégnis

Le 08/07/2010 17:34, Maxime Bégnis a écrit :

Hi all,

We upgraded our application to use the 1.6.2 version because we had
problems indexing XML files referencing an external DTD in their
DOCTYPEs (more specifically DITA files). The issue is stated to be fixed
in this version but we still have the same problem, the following
warning is printed several times to the log :

WARN org.apache.jackrabbit.core.query.lucene.TextExtractorJob:132 -
Exception while indexing binary property: java.io.FileNotFoundException:
http://docs.oasis-open.org/dita/dtd/reference.dtd

I'm wondering if we are doing something wrong somewhere, thanks if you
can help.

Bug reference : JCR-2645

These are the JackRabbit and JackRabbit-related jars we're using. I've
put an asterisk on those changed (for newer versions) from the
JackRabbit 1.6.2 WAR distribution

commons-codec-1.4.jar (*)
commons-collections-3.2.jar (*)
commons-fileupload-1.2.1.jar
commons-io-1.4.jar
concurrent-1.3.4.jar
derby-10.4.jar (*)
fontbox-0.1.0.jar
jackrabbit-api-1.6.2.jar
jackrabbit-core-1.6.2.jar
jackrabbit-jcr-commons-1.6.2.jar
jackrabbit-spi-1.6.2.jar
jackrabbit-spi-commons-1.6.2.jar
jackrabbit-text-extractors-1.6.2.jar
jcr-1.0.jar
jempbox-0.2.0.jar
log4j-1.2.15.jar (*)
lucene-core-2.4.1.jar
nekohtml-1.9.7.jar
pdfbox-0.7.3.jar
poi-3.2-FINAL.jar
poi-scratchpad-3.2-FINAL.jar
slf4j-api-1.5.8.jar (*)
slf4j-log4j12-1.5.8.jar (*)
xercesImpl-2.9.1.jar (*)
xml-apis-ext-1.3.04.jar (*)

Maxime Bégnis

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.jackrabbit.extractor;

import java.io.CharArrayReader;
import java.io.CharArrayWriter;
import java.io.FilterInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.Reader;
import java.io.StringReader;
import java.nio.charset.Charset;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Text extractor for XML documents. This class extracts the text content
 * and attribute values from XML documents.
 * <p>
 * This class can handle any XML-based format
 * (<code>application/xml+something</code>), not just the base XML content
 * types reported by {...@link #getContentTypes()}. However, it often makes
 * sense to use more specialized extractors that better understand the
 * specific content type.
 */
public class XMLTextExtractor extends AbstractTextExtractor {

    /**
     * Logger instance.
     */
    private static final Logger logger =
        LoggerFactory.getLogger(XMLTextExtractor.class);

    /**
     * Creates a new <code>XMLTextExtractor</code> instance.
     */
    public XMLTextExtractor() {
        super(new String[]{"text/xml", "application/xml"});
    }

    //-------------------------------------------------------< TextExtractor >

    /**
     * Returns a reader for the text content of the given XML document.
     * Returns an empty reader if the given encoding is not supported or
     * if the XML document could not be parsed.
     *
     * @param stream XML document
     * @param type XML content type
     * @param encoding character encoding, or <code>null</code>
     * @return reader for the text content of the given XML document,
     *         or an empty reader if the document could not be parsed
     * @throws IOException if the XML document stream can not be closed
     */
    public Reader extractText(InputStream stream, String type, String encoding)
            throws IOException {
        try {
            CharArrayWriter writer = new CharArrayWriter();
            ExtractorHandler handler = new ExtractorHandler(writer);

            // TODO: Use a pull parser to avoid the memory overhead
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser parser = factory.newSAXParser();
            XMLReader reader = parser.getXMLReader();
            reader.setContentHandler(handler);
            reader.setErrorHandler(handler);
            reader.setEntityResolver(handler);

            // It is unspecified whether the XML parser closes the stream when
            // done parsing. To ensure that the stream gets closed just once,
            // we prevent the parser from closing it by catching the close()
            // call and explicitly close the stream in a finally block.
            InputSource source = new InputSource(new FilterInputStream(stream) {
                public void close() {
                }
            });
            if (encoding != null) {
                try {
                    Charset.forName(encoding);
                    source.setEncoding(encoding);
                } catch (Exception e) {
                    logger.warn("Unsupported encoding '{}', using default ({}) instead.",
                            new Object[]{encoding, System.getProperty("file.encoding")});
                }
            }
            reader.parse(source);

            return new CharArrayReader(writer.toCharArray());
        } catch (ParserConfigurationException e) {
            logger.warn("Failed to extract XML text content", e);
            return new StringReader("");
        } catch (SAXException e) {
            logger.warn("Failed to extract XML text content", e);
            return new StringReader("");
        } finally {
            stream.close();
        }
    }

}

<<attachment: maxime.vcf>>

Fwd: Re: 1.6.2 : JCR-2645 XML text extraction in Jackrabbit 1.x accesses external resources

Reply via email to