Re: Progress of the unsigned int to XMLSize_t migration

David Bertoni Fri, 04 Jul 2008 13:17:03 -0700

Boris Kolpackov wrote:

Hi David,
David Bertoni <[EMAIL PROTECTED]> writes:
I should have been more clear about what I was proposing. Changing thecode to an unsigned type is risky, because the APIs previously used anegative value to indicate "not found" or "error." However, changing toa 64-bit _signed_ type on 64-bit platforms, while not a permanentsolution, adds another 32 bits for returning index values, without allof the problem with changing to an unsigned type.
I prefer to fix the problem properly rather than implementing temporary
solutions. Once you change int to XMLSSize_t that piece of code will
be indistinguishable from other legitimate uses of XMLSSize_t.

The code that returns length/size via a signed int should be redesign
to use XMLSize_t and use some other mechanism to signal an error (e.g.,
the function returns bool and has a by-reference argument of type
XMLSize_t).
As it stands now, on 64-bit platforms, the "indexing" address space inthe XMLString APIS is (usually) limited to INT_MAX. If we change it toa signed 64-bit type, it will be (usually) LONG_MAX. If we leave thingsas they are now, the incoming size parameters will be unsigned 64-bitvalues, while the return types will be signed 32-bit values. That seemsinconsistent, and will make it very difficult to write 32-bit/64-bitcompatible code.
There are three places where int is used in XMLString:

(1) regionMatches/regionIMatches

    These are some bizarre, special-purpose functions that can compare
    string regions using negative offsets. They are only used in regex
    and should be moved there instead of being in XMLString. In this
    case changing int to XMLSSize_t might be the only reasonable
    solution unless we can be certain that they are never called
    with negative offsets by the regex code.


The documentation for these functions says:

* has length charCount. The result is false if and only if at least
* one of the following is true:
*   offset1 is negative.
*   offset2 is negative.
*   offset1+charCount is greater than the length of str1.
*   offset2+charCount is greater than the length of str2.

So negative offsets immediately signify failure.  Since the function is
only called in two places in RegularExpression.cpp, it should be easy to
resolve this.

(2) hash functions
These currently return unsigned int and should be changed toXMLSize_t.

Agreed.


(3) indexOf functions

    Currently these return position in the string and -1 if not found.
    The way to address this is to return XMLSize_t and use ~((XMLSize_t)0)
    to indicate the "not found" condition. This is what std::string does.
    We should also define the npos constant in XMLString.

This is what Xalan-C's XalanDOMString class does, which also makes it
API compatible with the standard library.

Sounds good but please don't just mindlessly change signed int toXMLSSize_t.
I try not to mindlessly change anything. Trust me, I'm very carefulabout changing code without careful risk/benefit analysis.


Thank you. This comment was addressed to everybody who might want to
try and port some portion of Xerces-C++ to use XMLSize_t. I think we
should humble ourselves and acknowledge that the Xerces-C++ code quality
is quite poor. Some of the things you see are really troubling or just
plain stupid. For example, today I fixed the following fragment that
is quite close to what ends up on The Daily WTF:

int included = 0;

// Code that never touches included...

return (included) ? true : false;

So while we all try to be careful and mindful stuff like this somehow
ends up in the code.

Although that's a WTF moment, at least it's harmless.  In fact, I
wouldn't be surprised if some of the better compilers optimized that
out.  But there's a worse moment in XIncludeLocation.cpp where the
function casts away const on an input parameter and modifies it.  I've
got a patch to fix that, and few more crazy spots in the same file.

If you want more WTF moments, turn the warning level in Visual C++ from
3 to 4 (which I'd like to see as a permanent change) and rebuild the
library.  Yuck...

In the case of the collections, the int values are just used as keys, sothey have no semantic meaning to the collections themselves. Also,since these collections are more esoteric, they are not used in many places.
For example, RefHash3KeysIdPool is used by the scanners and schemagrammar to hash element declarations using the element name, namespaceURI and schema scope. The keys are a void* (the name), and unsigned intfor the URI ID, and an unsigned enum for the scope. So why should thekeys be ints when they're always storing unsigned ints?
I have nothing against changing this. Also note that changing the
container classes is just the small part of the job. Once you do
this you will need to change every place where these collections
are used (otherwise you will get tons of warning on, for example,
VC++ in 64-bit mode).

I've been building with Visual Studio at warning level 4, with the HP
aCC compiler, and with GCC at -Wall in an attempt to find the worst
spots in the code.


For example, id type in StringPool should be changed to XMLSize_t.
But the unsigned int is used throughout the code for string ids and
these places will need to be changed as well. One more or less
manageable way to do this might be to change StringPool and then
try to compile Xerces-C++ with VC++ in 64-bit mode. This will
result in the compiler pin-pointing with warnings all the places
that are affected by the change.

I'm happy to start working on many of these changes.  It's been
difficult for me to keep Xalan-C warning-free and 64-bit clean because
of all these inconsistencies in Xerces-C, so I'd love to get this fixed
for the 3.0 release.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Progress of the unsigned int to XMLSize_t migration

Reply via email to