Scott Cantor closed XERCESC-1166.

> Xerces cannot open file whose name includes UTF8 characters
> -----------------------------------------------------------
>                 Key: XERCESC-1166
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1166
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 2.4.0
>         Environment: Operating System: Other
> Platform: Macintosh
>            Reporter: Mark Goldstein
>            Assignee: James Berry
>             Fix For: 2.5.0
> I originally wrote about this as attached below.
> James Berry asked me to file the big report, see his e-mail below as well.
> On Feb 25, 2004, at 5:31 PM, Mark Goldstein wrote:
> Hello,
> Using Xalan/Xerces I tried to transform a file with a name that included an 
> "e"
> with accent. Your mailer might show it:
> féébad.xml
> The command line call (using Mac OS-X copy/paste which converts the characters
> to octal constants) looks like this:
> mark$ ./Xalan -o foo.out fe\314\201e\314\201bad.xml foo.xsl
> And it results in this error:
> Fatal Error at (unknown file , line 0 , column {null} ): An exception 
> occurred!
> Type:RuntimeException,
> Message:The primary document entity could not be opened. Id=féébad.xml
> SAXParseException: An exception occurred! Type:RuntimeException,
> Message:The primary document entity could not be opened. Id=féébad.xml (, 
> line
> 0, column 0)
> Is this a known bug? Is there a work-around?
> This isn't a known bug, but, having done a bit of snooping, I do believe that 
> it
> is a bug.
> Here's what I think is going on:
> Xerces creates a transcoder that converts from the local code page to unicode
> (LCP Transcoder). On Mac OS, it assumes the local code page is whatever the
> default system script encoding is, which is often MacRoman. This LCP 
> Transcoder
> is used whenever a XMLString is created from a char*. That is done, for
> instance, as part of taking a file off the command line and creating a parser
> from it.
> The problem in your case is that the characters coming off the Mac OS X 
> command
> line are actually utf-8, not (MacRoman, or whatever). They're being converted 
> to
> utf-16 as if they were MacRoman. And all hell breaks loose, including the
> unfortunate fact that the file can't be opened.
> This is a bit of a no-win situation. We could simply make the LCP Transcoder
> assume the LCP is always utf-8, but that would require a major re-architecting
> of the transcoder, since we rely on the lower level unicode converter, which
> can't transcode between unicode encodings, only to and from them. It also may
> not be quite the right answer either, since it just fixes the situation for 
> the
> command line and ignores the fact that there are a number of other LCP 
> encodings
> being used, which this decision could affect.
> There are probably a number of workarounds, but they all basically boil down 
> to
> not relying on the LCP transcoder to convert the utf-8 string from the command
> line into unicode in the first place. For instance, you could explicitly call
> the intrinsic utf-8 transcoder through Transervice, or cheat and call
> TranscodeUTF8ToUniChars, which is buried down in MacOSPlatformUtils. There are
> probably better solutions, but it's getting late for me now. Once you have the
> filename in utf-16, pass that directly into the parser.
> There may be other simpler workarounds, which might include simply changing 
> the
> encoding of text in the terminal to MacRoman, or whatever. But it's making my
> head hurt to understand the interactions that would occur in that case...your
> file wouldn't list correctly in that case, I would think.
> Please let me know how it goes, and if you could write a bug report that would
> help as well.
> James.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org

Reply via email to