Your file isn't correct xml. Use an UTF-8 aware editor and things should
work out fine.

With UTF-8, the latin alfabet characters are encoded as one byte whereas
some other, for example åäö are encoded with two bytes. These two bytes are
encoded in such a fashion that when the parser encounters the first it knows
that the second should also be counted as the same character. When an
illegal byte group is encountered, this problem occurs. You most propably
have used an editor that produces non-UTF-8 byte groups.

Regards Erik

-----Original Message-----
From: [EMAIL PROTECTED]
To: dom4j-user@lists.sourceforge.net
Sent: 2005-09-22 23:29
Subject: [dom4j-user] Invalid byte 2 of 3-byte UTF-8 sequence

Hi,

i have a problem with UTF-8.
I want to add special characters, like german umlauts: äöü or french
characters: e.g. é in my XML-File.
I can add the characters to an xml-file, but when i try to read
thegenerated file with the SAXParser, an error will occure (Invalid byte
2of 3-byte UTF-8 sequence) - see at the end of this message.
The XML-File looks fine in my editor (UltraEdit), but my
System.out-Trace show this: äöü for äöü.
With ISO-8859-1 everything works fine.
Where is my mistake?

I searched the dom4j faq, cookbook and internet for this problem, but
the only thing i found was this:
http://sourceforge.net/mailarchive/message.php?msg_id=10356047
It didn't help...

Thanks for any advice!

Udo Krass

My code:
-------<snip>-------
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.DocumentHelper;
import org.dom4j.Element;
import org.dom4j.io.OutputFormat;
import org.dom4j.io.SAXReader;
import org.dom4j.io.XMLWriter;
import org.xml.sax.SAXException;

public class Test {

    public static void main(String[] args) {
        File theFile = new File("C:/t.xml");

        File testOutputFile = new File("C:/t.xml");
        /*try {
            parse(theFile);
        } catch (DocumentException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } */
       
        Document document;
        document = DocumentHelper.createDocument();
        Element root = document.addElement( "go" );
        document.getRootElement().add(DocumentHelper.createText("äöü"));
        
        try {
            writeToFile(document,testOutputFile,true);
        } catch (IOException e) {
            e.printStackTrace();
        }

    }
        public static Document parse(File theFile) throws
DocumentException {
            SAXReader reader = new SAXReader();
            Document document = null;
            document = reader.read(theFile);
            return document;
        }
    public static void writeToFile(Document theDocument, File
theOutputFile, Boolean trace) throws IOException
    {
        //lets write to a file
        OutputFormat format = OutputFormat.createCompactFormat();
        format.setEncoding("UTF-8");
        format.setNewlines(true);
        format.setIndentSize(2);
        format.setTrimText(false);
   
        XMLWriter xmlWriter = new XMLWriter(new
FileWriter(theOutputFile), format);
                    xmlWriter.write(theDocument);
                   xmlWriter.flush();
                   
                   xmlWriter.close();

        if (trace) {
            // print the document to System.out
            xmlWriter = new XMLWriter(System.out, format);
            format.setEncoding("UTF-8");
            xmlWriter.write(theDocument);
            xmlWriter.flush();
            xmlWriter.close();
        }
    }
}
-------<snap>-------

the generated xml-File:
-------<snip>-------
<?xml version="1.0" encoding="UTF-8"?>

<go>äöü</go>
-------<snap>-------

this is the System.out. output, when i uncomment the parse section and
read the file with the SAXParser:
-------<snip>-------
org.dom4j.DocumentException: Error on line 3 of documentfile:///C:/t.xml
: Invalid byte 2 of 3-byte UTF-8 sequence. Nestedexception: Invalid byte
2 of 3-byte UTF-8 sequence.
    at org.dom4j.io.SAXReader.read(SAXReader.java:350)
    at org.dom4j.io.SAXReader.read(SAXReader.java:222)
    at Test.parse(Test.java:51)
    at Test.main(Test.java:25)
Nested exception:
org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
 
atcom.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXP
arseException(ErrorHandlerWrapper.java:236)
    at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(E
rrorHandlerWrapper.java:215)
    at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XML
ErrorReporter.java:386)
    at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XML
ErrorReporter.java:316)
 
atcom.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:
1810)
 
atcom.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl
.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
    at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML1
1Configuration.java:834)
    at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML1
1Configuration.java:764)
    at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.jav
a:148)
    at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Abstr
actSAXParser.java:1242)
    at org.dom4j.io.SAXReader.read(SAXReader.java:334)
    at org.dom4j.io.SAXReader.read(SAXReader.java:222)
    at Test.parse(Test.java:51)
    at Test.main(Test.java:25)
Nested exception: org.xml.sax.SAXParseException: Invalid byte 2 of
3-byte UTF-8 sequence.
 
atcom.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXP
arseException(ErrorHandlerWrapper.java:236)
    at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(E
rrorHandlerWrapper.java:215)
    at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XML
ErrorReporter.java:386)
    at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XML
ErrorReporter.java:316)
 
atcom.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:
1810)<?xmlversion="1.0" encoding="UTF-8"?>

<go>äöü</go>

 
atcom.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl
.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
    at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML1
1Configuration.java:834)
    at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML1
1Configuration.java:764)
    at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.jav
a:148)
    at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Abstr
actSAXParser.java:1242)
    at org.dom4j.io.SAXReader.read(SAXReader.java:334)
    at org.dom4j.io.SAXReader.read(SAXReader.java:222)
    at Test.parse(Test.java:51)
    at Test.main(Test.java:25)
-------<snap>-------








-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
dom4j-user mailing list
dom4j-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dom4j-user

Reply via email to