Dietmar Kuehl wrote:
That's a good point. On my platform (Win32) mbstate_t indeed is a typedef to "int" and thus satisfies all requirements (i.e.: being an integer type capable of containing an unsigned 21-bit value). I was aware that this is not the case on other platforms, but I made the assumption anyway simply because the user is not really required to use mbstate_t. One can use a char traits class different from std::char_traits<T>, that defines a suitable state type. For example:- The 'state' argument for the various facet functions is normally accessed as a plain integer in your code. However, eg. 'std::mbstate_t' is not an integer at all. In fact, it is implementation defined and can thus be something entirely different on each platform. I'd suggest to provide some from of accessor functions to read or write objects of this type, possibly in the form of some traits class. This could then be used to cope with differences and/or more complex organization of the state type.
template <typename T>
struct MyTraits : public std::char_traits<T>
{
typedef boost::uint32_t state_type;
};
then use basic_fstream<T, MyTraits<T> > instead of basic_fstream<T>.
[I just tried that and found a typo in correctness_test.hpp that prevents the test suite to compile, but nothing too serious]
Of course I have to document this ;)
If you believe that harnessing with the char traits class is undesirable, or simply "too much", I see no problems in adding those accessors function as you suggest.
I forgot to say in my previous post that this version of the library only supports platforms where type char is exactly 8 bits. This assumption is required because I have to work at the octet level while reading from/writing to a stream.- Falling back to UTF-16 in cases where the Unicode characters may not fit into the words is one possible approach to deal with the characters. Another approach is to just indicate an error. For example, I know that certain XML-files I'm using exclusively store ASCII characters and there is no need to use a different internal character type than 'char'. If I ever come across a non-ASCII character I don't really want to have it encoded in something like UTF-8 but I want to fail reading the file (and possibly retry reading using a bigger character type). I would appreciate if such a possibility would be incorporated, too (this is, however, nothing I feel too strongly about).
That said, I have deliberately decided not to allow "char" as the internal type. The internal type must be an integral type able to represent an unsigned 16-bit value (for UTF-16) or an unsigned 21-bit value (UTF-32). The choice between UTF-16 and UTF-32 as the internal encoding is done at compile-time based on the sizeof of the internal char type, which must be either 2 or 4.
Such decision is very strong, I know. Yet, one of the main problems with the acceptance of Unicode as a standard is that there are too many applications around that uses only a subset of it. For example, one of the first feedback I got, at the beginning of this work, was "I don't need to handle surrogates, could you provide an optimized facet for that case?". The answer was "Yes, I could, but I won't".
Yours is a very special case. Frankly, I'd rather not support it. In fact, it would be extremely very simple to do: you just take the facet declared in file detail/degenerate.hpp and change a few lines. However, it would be out of the intent of this library, which is to provide UTF conversion according to Unicode 3.2 requirements, no more, no less.
There already exist a facility to select the correct facet according to the byte order mark. It's very simple to use:- In the context I want to use the facets [at least those I'm implementing] I don't really want to bother about the details of the encoding. That is, I just want to 'imbue()' an appropriate 'std::locale' object to the stream and have the facet figure out what encoding is used, especially when reading from some file. The basic idea is that each file is either started by a byte order mark from which UTF-16BE or UTF16LE can be deduced or it is in UTF-8. The encoding could eg. be stored in typical 'std::mbstate_t' objects (these are often a struct with a 'wchar_t' and a count or something like this). To enable something like this, it would be helpful if the actual conversion functions were actually separated from the facets: The facets would just call the corresponding conversion functions. The actual conversion is, apart from the state argument, entirely stateless and there is no need to bind it to any facet.
std::wifstream file("MyFile", std::ios_base::binary);
boost::utf::imbue_detect_from_bom(file);
that's it.
I considered separating the facets class from the conversion implementation code, but I decided to keep the design one facet == one encoding. There are a couple of minor advantages:
1) the user can have just one or two encondings in your executable without taking all of them. (Of course if you use imbue_detect_from_bom you will need all encodings, anyway... ;)
2) I avoid one indirection. The cost of this cannot be underestimated. The VS.NET implementation is quite brain-dead about codecvt: it will call the do_in function (which is virtual!) up to 4 times *per character*. Three times the function will return "partial", the fourth times it will succeed. For every character. The less work I do in the fail case, the better.
This is not correct. No conversion function makes assumptions on the size of the buffers (not intentionally, I mean!)- The conversion functions seem to assume that there are input sequence consisting of more than one character passed to it. I don't think this is guaranteed at all. At the very minimum, the facets should defend against being called with too small buffers (at least in 'do_out()' of 'utf16_utf32' the constant 3 is substracted from a pointer without prior check that the input area consists of at least three characters, thereby possibly invoking undefined behavior). I think that it is acceptable behavior of a facet user (like eg. a file stream buffer) to feed individual characters to the 'do_*()' functions or at least to increase the buffers char by char (I think that at least one of the file stream buffer implementations actually does just that in certain cases!).
In the specific case of utf16_utf32 you have:
for(const char* limit = to_limit - 3; from < from_end && to < limit; ++from)
{
// accesses buffer using *to
}
if the buffer is shorter than 4 characters, the test "to < limit" fails immediately and the loop is never entered. The buffer is never accessed before or after the loop, so I don't see any undefined behaviour here.
Looking better, the only case that can go wrong is if the function is called with to == to_limit == 0, but I don't think a stream implementation could be so insane to do that. I can add a check for this case, though. Maybe under assert()?
You are right, UTF-* encoding are in fact stateless. However, for a reason too complex to describe here (it will be written in the documentation for sure!) the facets that use UTF-16 internally need to be state-dependent in order to support surrogates. The UTF-32 facets don't need a special treatment and are indeed stateless.- My understanding of 'encoding()' is that '-1' indicates that "shift states" are used and that a special termination sequence returning the state into some initial state is possibly needed. That is, although the 'state' variable is used, the UTF-* encodings are considered to be stateless and there is no need to 'unshift()' anything because the 'do_out()' function will eventually write all what is necessary if it gets a big enough output buffer. That is, the 'encoding()' members should probably return a '0' or a positive integer for all UTF-* encodings since the encodings are essentially stateless. My personal feeling is that UTF-16 is, however, a variable width encoding, even if the internal encoding is also UTF-16: A perverted user of the facet might legally present the indivual words out of order to the facet of a fixed width encoding, eg. in a first pass all odd words and in a second all even words. This would rip surrogate sequences apart causing unexpected errors.
If you can't wait for docs to be written, the root of all evil is the C++ Library issue #76:
http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/lwg-defects.html#76
Thanks! Then I have to prepare the docs quickly! I put more comments in the code than I usually do, but I feel they are not enough to explain everything ;)The code itself is pretty readable. However, I haven't made a thorough review of it [yet].
What would be very helpful, at this stage, is having people who could compile and run the test suite on their favorite platform. This will surely stress the library and give some precious feedback.
Next step for me: prepare a test suite that overcome the mbstate_t problem ASAP.
Alberto
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost