[dom4j-dev] [ dom4j-Bugs-1003141 ] SAXReader.read(File file) character encoding problem

SourceForge.net Fri, 08 Oct 2004 10:15:13 -0700

Bugs item #1003141, was opened at 2004-08-04 02:50
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=116035&aid=1003141&group_id=16035


Category: None
Group: None
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Zhang Tao (robintj)
Assigned to: Maarten Coene (maartenc)
Summary: SAXReader.read(File file) character encoding problem

Initial Comment:
In the SAXReader.read(File file) function, the code is:
  return read( new InputSource(new FileReader(file)) );

But the FileReader Class says: 
    Convenience class for reading character files. The
 constructors of this class assume that the default
character encoding and the default byte-buffer size are
appropriate. To specify these values yourself,
construct an InputStreamReader on a FileInputStream.
    FileReader is meant for reading streams of
characters. For reading streams of raw bytes, consider
using a FileInputStream. 

It means FileReader only use the "default character
encoding". And when I program use this code:
  File f = new File(fname);
  SAXReader reader = new SAXReader();
  Document doc = reader.read(f);
It cannot read correct Chinese character from XML File
that uses UTF-8 encoding (and my system default
character encoding is zh_CN.GBK).

But if I change the code to:
  Document doc = reader.read(new FileInputStream(f));
All is OK.

So I advice SAXReader.read(File file) function changed to:
  return read( new InputSource(new
FileInputStream(file)) );


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-10-08 08:40

Message:
Logged In: NO 

Hmm, I have a problem with this...

I have a string res witch contains a complete document. 
(including, encoding=iso-8859-1 header).
If I
Document receive = DocumentHelper.parseText(res);
several characters inside the document are corrupted, and the 
encoding is set to UTF-8 despite the fact that the string res 
already contains the correct encoding.

If I write the same string to a file it looks fine, but it fails to do 
a valid read of the file again:

FileWriter fw = new FileWriter( new java.io.File("C:/temp/
resultat.xml"));
fw.write(res);
fw.flush();
fw.close();
SAXReader saxreader = new SAXReader();
saxreader.setXMLReaderClassName("org.apache.xerces.
parsers.SAXParser");
Document receive = saxreader.read("C:/temp/resultat.xml");

it show's up as an UTF-8 encoded document, with some of the 
characters corrupted.

What on earth is wrong?!

----------------------------------------------------------------------

Comment By: Maarten Coene (maartenc)
Date: 2004-08-04 03:07

Message:
Logged In: YES 
user_id=178745

This has already been fixed in dom4j 1.5 !

thanks for the report
Maarten

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=116035&aid=1003141&group_id=16035


-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
dom4j-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dom4j-dev

[dom4j-dev] [ dom4j-Bugs-1003141 ] SAXReader.read(File file) character encoding problem

Reply via email to