Re: parsing XML from string using MemBufInputSource

David Bertoni Thu, 30 Apr 2009 11:54:00 -0700

S. Gross wrote:

David Bertoni schrieb:
S. Gross wrote:
Hi there,
i tried to parse a const char* via MeMBufInputSource in the followingway:
CString RecieveString= "...." //Using Unicode and
                  //wchar_t as a built-in type
const char* gXMLInMemBuf =XMLString::transcode(RecieveString.GetString());
You should never transcode a Unicode string to the local code page,because you never know if it will support all of the characters in thestring. Also, unless you explicitly set the encoding on theInputSource, the parser will assume either UTF-8 or whatever encodingis specified in the XML declaration.
MemBufInputSource* memBufIS = new MemBufInputSource
(
    (const XMLByte*)gXMLInMemBuf
    , static_cast<const XMLSize_t>(strlen(gXMLInMemBuf))
    , "test"
    , false
);

parser->parse(memBufIS); //Error on WinXP
This is working fine on my system (Vista Business 32). I am usingVC++9 and Xerces 3.0.1.
You should check to see what your local code page is set to. It'sprobably UTF-8.
But I am running in troubles on Win XP 32. An Error is thrown withthe Message :
error: invalid byte 't' at position 2 of a 4-byte sequence
I'm not sure what the local code page is on this machine, but it iscertainly not UTF-8. It seems your document either has an explicitencoding declaration of UTF-8, or it doesn't have one at all, whichimplies UTF-8.
The resulting DOMDocument* from parsing was thought as input for xsdfrom CodeSynthesis. So I tried it the other way and put"gXMLInMemBuf" int a stringstream as input for xsd.
Different solution - same problem. I suppose that this is a problemof the encoding i can't figure out (or mayby sth different).
Any suggestions would be really helpfull because this is part of mywork for study and i am trying to get along with it as fast as possible.
There's no need to do any transcoding of a UTF-16 string. In fact,the parser operates internally in UTF-16, so it's the most efficientrepresentation:
MemBufInputSource* memBufIS = new MemBufInputSource
(
    RecieveString.GetString(),
    , static_cast<const XMLSize_t>(RecieveString.GetLength())
    , "test"
    , false
);
If you have reason to believe the XML document in the CString instancehas an encoding declaration that is not UTF-16, you should explicitlyset the encoding for the InputSource:
memBufIS->setEncoding(L"UTF-16LE");

BTW, you've mis-spelled "receive" in your variable name.

Dave
Thanks for this quick response!
I tried your suggestion but that did not work, because of a compilerError that said, that it is not possible to convert "const wchar_t*" to"const XMLByte*".

Yes, a typo in my code snippet:

    reinterpret_cast<const XMLByte*>(RecieveString.GetString());

I tried it brute force with a cast to "const XMLByte*" but don't I needto multiply the length by 2 because of the UTF-16 encoding.

Sigh...  Yes, it should be:

static_cast<const XMLSize_t>(RecieveString.GetLength() * sizeof(wchar_t))

It worked with memBufIS->setEncoding(L"UTF-8") and my old setup at home.
I have to try that at the Lab at Tuesday. I can assure that my CStringcontains proper characters in english and nothing else. That should workin Germany to.

I wouldn't bet on it. If you want your application to be robust, youshould use the UTF-16 version of the data and force the encoding on theInputSource. It will also avoid the extra CPU cycles and memoryallocation to transcode the data.

If you decide to keep the call to XMLString::transcode(), don't forgetto call XMLString::release(&gXMLInMemBuf) when you're done with the data.


Dave

Re: parsing XML from string using MemBufInputSource

Reply via email to