Re: Invalid UTF-8 error for non-ASCII meta data (was: Re: Reality check on 1.4x (long))

Josias Thoeny Wed, 24 Aug 2005 08:01:46 -0700

On Mon, 2005-08-22 at 11:44 +0200, Andreas Hartmann wrote:
> Angelo Turetta wrote:
> 
> [...]
> 
> > In the authoring page, click on 'Document type examples', and watch a 
> > wonderful 'Invalid byte 2 of 3-byte UTF-8 sequence' error. I've found 
> > the cause of the problem: every page that has non-ASCII chars in the 
> > meta information fails almost every operation.
> 
> I can't reproduce this one ...
> How about the others?


I have seen this error.
I tried to debug this some time ago, but I didn't find out much.

To reproduce it, go to the site area of the default publication. Select
e.g. the "Document Type Examples" page, copy it, and insert it
somewhere.

Here is the relevant part of the stacktrace that I'm getting:

java.io.UTFDataFormatException: Invalid byte 2 of 3-byte UTF-8 sequence.
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.loadValues(MetaDataImpl.java:169)
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.<init>(MetaDataImpl.java:82)
        at 
org.apache.lenya.cms.metadata.LenyaMetaData.<init>(LenyaMetaData.java:74)
        at 
org.apache.lenya.cms.metadata.MetaDataManager.getLenyaMetaData(MetaDataManager.java:80)
        at 
org.apache.lenya.cms.metadata.MetaDataManager.replaceMetaData(MetaDataManager.java:151)
        at 
org.apache.lenya.cms.repository.RepositoryManagerImpl.copy(RepositoryManagerImpl.java:40)
        ... 100 more
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 3-byte UTF-8 
sequence.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at 
org.apache.lenya.xml.DocumentHelper.readDocument(DocumentHelper.java:173)
        at 
org.apache.lenya.cms.cocoon.source.SourceUtil.readDOM(SourceUtil.java:161)
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.getDocument(MetaDataImpl.java:260)
        at 
org.apache.lenya.cms.metadata.MetaDataImpl.loadValues(MetaDataImpl.java:139)
        ... 105 more

Here is another way to get the same problem:
- Create a document with an umlaut in the navigation title (the umlaut
will be written into the sitetree).
- Perform an operation which changes the sitetree (e.g. create another
document)

I get the following stacktrace:
<snip/>
Caused by: java.io.UTFDataFormatException: Invalid byte 2 of 3-byte
UTF-8 sequence.
        at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 Source)
        at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
        at 
org.apache.lenya.xml.DocumentHelper.readDocument(DocumentHelper.java:173)
        at 
org.apache.lenya.cms.cocoon.source.SourceUtil.readDOM(SourceUtil.java:161)
        at 
org.apache.lenya.cms.site.tree.DefaultSiteTree.<init>(DefaultSiteTree.java:83)
        ... 51 more


It seems the problem occurs when lenya reads a non-ascii char from a
file and saves it again, using DocumentHelper/SourceUtil. If the special
char comes from a web-form, it seems to be saved correctly.

Can anyone else reproduce this?

I wonder if it might have something to do with the following code in
SourceUtil.java, around line 195:

    ....
    OutputStream oStream = source.getOutputStream();
    Writer writer = new OutputStreamWriter(oStream);
    DocumentHelper.writeDocument(document, writer);
    ....

The OutputStreamWriter assumes a default encoding, and does not respect
the encoding of the source. But that's just a guess, actually I'm not
sure whether it's a reading or a writing problem.

Or is something wrong with my setup?

Josias



> 
> -- Andreas
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Invalid UTF-8 error for non-ASCII meta data (was: Re: Reality check on 1.4x (long))

Reply via email to