RE: xmlbs based word-cleaner encoding problems

Nico Klasens Wed, 14 Jan 2004 08:40:46 -0800

> hi
> I am using the xmlbs based wordcleaner for hour latest website
> (www.variatee.nl), and i am running into some encoding issues. It seems
> that
> the xmlbs code relies on the file.encoding system property, becouse if
> this
> value happens to 'ASCII', encoding is scewed up.
> The obvious thing would be to use the XMLBSsetEncoding() method as in:
> 
>          String encoding=mmbase.getEncoding();
>          xmlbs.XMLBS xmlbs =new xmlbs.XMLBS("<body>" + textStr +
> "</body>",
> xmlbsDTD);
>          xmlbs.setEncoding(encoding);
>          xmlbs.process();
> 
> unfortunately this has no effect. regardless of the value introduced by
> the
> setEncoding() method, the filed.encoding property is used, which seens
> strange to me.
> 
> Do i miss something? or should this be considered as an undesired feature
> of
> the xmlbs code. Of corse it is not a big deal to set the file.encoding
> property prior to cleaning any fields, but i think it is confusing this
> way,
> and mmbase encoding configuration should precede java global encoding
> configuration


Hello Ernst,

This is not a xmlbs only issue. The xmlbs code is char-oriented. The
setEncoding from xmlbs only works when you use an InputStream. An
inputStream reads bytes and converts it to a char with the specified
encoding. A string should already have the right encoding, because it is a
char array.
Fields which aren't cleaned are saved correctly to the database. This goes
ok, because the bytes aren't converted to chars to do string modifications. 

I have seen some situations where the html cleaning resulted in ?-marks. One
of them is when a jsp reads a string with getParameter() and the
request.setCharacterEncoding() is not called with the right client encoding.
You have to add the following to the jsp with the form to make sure it is in
utf-8 (could also be iso-8859-1 or cp1252).

<%@ page language="java" contentType="text/html; charset=utf-8" %>
And sometimes
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Then a characterencoding filter has to be used to set the request always to
utf-8 (see org.mmbase.servlet.CharacterEncodingFilter).

On an iso-8859-1/cp1252 OS the jsp and filter settings are enough to save
utf-8 data. It still fails on OSes with the ascii character encoding. I
don't know exactly why, but somewhere (in application server or MMBase) are
the incoming bytes already converted to the OS encoding. We solved it on
some systems with the -Dfile.encoding=utf-8 commandline option.

When you use the -Dfile.encoding option you have to read the below bug
reports. They will tell you that the option is on some OSes a readonly
property and won't change the jvm encoding. Eg. Redhat is one of these OSes
and you will get weird results (String.equals), because some parts of java
will use the file.encoding and some will use the jvm encoding.
The system locale usually defines the system encoding. On Redhat the
LANG-en_us.en uses the ISO8859-1 encoding and LANG=en_US.UTF-8 uses the
utf-8 encoding. Fortunately, the locale settings is usually user defined.

- http://developer.java.sun.com/developer/bugParade/bugs/4397522.html
Bug ID: 4397522 Read only status of file.encoding varies by platform
- http://developer.java.sun.com/developer/bugParade/bugs/4163515.html
Bug ID: 4163515 -Dfile.encoding option doesn't affect default ByteToChar
converter
- http://developer.java.sun.com/developer/bugParade/bugs/4165411.html
Bug ID: 4165411 java.lang.System: Forbid the modification of read-only
system properties
- http://developer.java.sun.com/developer/bugParade/bugs/4175635.html
Bug ID: 4175635 default file encoding not specified

There are a lot of client/server encoding issues. When you use iso-8859-1 on
the server you will have problems with receiving characters from cp1252
windows. Cp1252 is almost iso-8859-1 and windows will send you cp1252 when
you request iso-8859-1. In some cases you will receive iso-8859-1 control
characters, because they are valid in cp1252 (eg. single and double quotes
from msword). 
When you use utf-8 on the server then some other things will fail. One of
them is recognizing the mimetype for attachments and images when they are
compared with the string type (magic.xml). The bytes in the uploaded file
could be in cp1252 or iso-8859-1 and the server compares them with utf-8
bytes strings. I did a fix recently so when it fails it will try to find the
extension mimetype.

So, in short, this is not really a xmlbs thing, but more an environment
thing. XMLBS can't fix it when the chars (bytes/encoding) are wrong.

Nico Klasens
 
Finalist IT Group
Java Specialists


------------ A JSP which does some handy encoding stuff ---------------
<%@ page language="java" contentType="text/html; charset=utf-8" %>
<%@ page import="sun.io.Converters"%>
<[EMAIL PROTECTED] uri="http://www.mmbase.org/mmbase-taglib-1.0"; prefix="mm" %>

<%
        try {
                        request.setCharacterEncoding("UTF-8");
        }
        catch (Exception e) {
                %>
                <%= e.toString() %>
                <%
        }
%>
        
<mm:cloud>
<html>
<head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"
/>
</head>
<body>

File.encoding: <%=System.getProperty("file.encoding") %><br/>
OS encoding: <%= Converters.getDefaultEncodingName() %><br/>
Bytes received: <% 
        if (request.getParameter("intro") != null) {
                byte[] bytes =
request.getParameter("intro").getBytes("UTF-8");
                for (int i = 0; i < bytes.length; i++) {
                        %><%= Integer.toString(bytes[i]) %> <%
                }
        }
%><br/>
request: <%= request.getParameter("intro") %><br/>

<mm:import externid="intro" />
<mm:present referid="intro">
MMBase: <mm:write referid="intro" /><br/>
</mm:present>

<form method="post">
        <input type="text" name="intro" value=""/>
        <input type="submit" />
</form>
</body>
</html>

</mm:cloud>

RE: xmlbs based word-cleaner encoding problems

Reply via email to