DO NOT REPLY [Bug 36341] New: - Invalid UTF-8 error for non-ASCII meta data

bugzilla Wed, 24 Aug 2005 09:17:48 -0700

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=36341>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.


http://issues.apache.org/bugzilla/show_bug.cgi?id=36341

           Summary: Invalid UTF-8 error for non-ASCII meta data
           Product: Lenya
           Version: Trunk
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: critical
          Priority: P2
         Component: Miscellaneous
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


Josias Thoeny:


I have seen this error.
I tried to debug this some time ago, but I didn't find out much.

To reproduce it, go to the site area of the default publication. Select
e.g. the "Document Type Examples" page, copy it, and insert it
somewhere.

Here is the relevant part of the stacktrace that I'm getting:

java.io.UTFDataFormatException: Invalid byte 2 of 3-byte UTF-8 sequence.
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.loadValues(MetaDataImpl.java:169)
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.<init>(MetaDataImpl.java:82)
        at 
org.apache.lenya.cms.metadata.LenyaMetaData.<init>(LenyaMetaData.java:74)
        at
org.apache.lenya.cms.metadata.MetaDataManager.getLenyaMetaData(MetaDataManager.java:80)
        at
org.apache.lenya.cms.metadata.MetaDataManager.replaceMetaData(MetaDataManager.java:151)
        at
org.apache.lenya.cms.repository.RepositoryManagerImpl.copy(RepositoryManagerImpl.java:40)
        ... 100 more
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 3-byte UTF-8 
sequence.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at 
org.apache.lenya.xml.DocumentHelper.readDocument(DocumentHelper.java:173)
        at 
org.apache.lenya.cms.cocoon.source.SourceUtil.readDOM(SourceUtil.java:161)
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.getDocument(MetaDataImpl.java:260)
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.loadValues(MetaDataImpl.java:139)
        ... 105 more

Here is another way to get the same problem:
- Create a document with an umlaut in the navigation title (the umlaut
will be written into the sitetree).
- Perform an operation which changes the sitetree (e.g. create another
document)

I get the following stacktrace:
<snip/>
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 3-byte
UTF-8 sequence.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at 
org.apache.lenya.xml.DocumentHelper.readDocument(DocumentHelper.java:173)
        at 
org.apache.lenya.cms.cocoon.source.SourceUtil.readDOM(SourceUtil.java:161)
        at 
org.apache.lenya.cms.site.tree.DefaultSiteTree.<init>(DefaultSiteTree.java:83)
        ... 51 more


It seems the problem occurs when lenya reads a non-ascii char from a
file and saves it again, using DocumentHelper/SourceUtil. If the special
char comes from a web-form, it seems to be saved correctly.

Can anyone else reproduce this?

I wonder if it might have something to do with the following code in
SourceUtil.java, around line 195:

    ....
    OutputStream oStream = source.getOutputStream();
    Writer writer = new OutputStreamWriter(oStream);
    DocumentHelper.writeDocument(document, writer);
    ....

The OutputStreamWriter assumes a default encoding, and does not respect
the encoding of the source. But that's just a guess, actually I'm not
sure whether it's a reading or a writing problem.

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 36341] New: - Invalid UTF-8 error for non-ASCII meta data

Reply via email to