[boost] Re: UTF library available for review

Alberto Barbati Tue, 07 Jan 2003 13:55:06 -0800

First of all, thanks to everybody for your feedback. I realized that my message was a bit arrogant about the lack of interest... I apologize for that and I promise I'll give my best to get this library to boost standards!

Dietmar Kuehl wrote:

- The 'state' argument for the various facet functions is normally accessed
  as a plain integer in your code. However, eg. 'std::mbstate_t' is not an
  integer at all. In fact, it is implementation defined and can thus be
  something entirely different on each platform. I'd suggest to provide some
  from of accessor functions to read or write objects of this type, possibly
  in the form of some traits class. This could then be used to cope with
  differences and/or more complex organization of the state type.

That's a good point. On my platform (Win32) mbstate_t indeed is a typedef to "int" and thus satisfies all requirements (i.e.: being an integer type capable of containing an unsigned 21-bit value). I was aware that this is not the case on other platforms, but I made the assumption anyway simply because the user is not really required to use mbstate_t. One can use a char traits class different from std::char_traits<T>, that defines a suitable state type. For example:

template <typename T>
struct MyTraits : public std::char_traits<T>
{
typedef boost::uint32_t state_type;
};

then use basic_fstream<T, MyTraits<T> > instead of basic_fstream<T>.

[I just tried that and found a typo in correctness_test.hpp that prevents the test suite to compile, but nothing too serious]

Of course I have to document this ;)

If you believe that harnessing with the char traits class is undesirable, or simply "too much", I see no problems in adding those accessors function as you suggest.

- Falling back to UTF-16 in cases where the Unicode characters may not
  fit into the words is one possible approach to deal with the characters.
  Another approach is to just indicate an error. For example, I know that
  certain XML-files I'm using exclusively store ASCII characters and there
  is no need to use a different internal character type than 'char'. If I ever
  come across a non-ASCII character I don't really want to have it encoded
  in something like UTF-8 but I want to fail reading the file (and possibly
  retry reading using a bigger character type). I would appreciate if such a
  possibility would be incorporated, too (this is, however, nothing I feel too
  strongly about).

I forgot to say in my previous post that this version of the library only supports platforms where type char is exactly 8 bits. This assumption is required because I have to work at the octet level while reading from/writing to a stream.

That said, I have deliberately decided not to allow "char" as the internal type. The internal type must be an integral type able to represent an unsigned 16-bit value (for UTF-16) or an unsigned 21-bit value (UTF-32). The choice between UTF-16 and UTF-32 as the internal encoding is done at compile-time based on the sizeof of the internal char type, which must be either 2 or 4.

Such decision is very strong, I know. Yet, one of the main problems with the acceptance of Unicode as a standard is that there are too many applications around that uses only a subset of it. For example, one of the first feedback I got, at the beginning of this work, was "I don't need to handle surrogates, could you provide an optimized facet for that case?". The answer was "Yes, I could, but I won't".

Yours is a very special case. Frankly, I'd rather not support it. In fact, it would be extremely very simple to do: you just take the facet declared in file detail/degenerate.hpp and change a few lines. However, it would be out of the intent of this library, which is to provide UTF conversion according to Unicode 3.2 requirements, no more, no less.

- In the context I want to use the facets [at least those I'm implementing]
  I don't really want to bother about the details of the encoding. That is, I
  just want to 'imbue()' an appropriate 'std::locale' object to the stream and
  have the facet figure out what encoding is used, especially when reading
  from some file. The basic idea is that each file is either started by a byte
  order mark from which UTF-16BE or UTF16LE can be deduced or it is in
  UTF-8. The encoding could eg. be stored in typical 'std::mbstate_t'
  objects (these are often a struct with a 'wchar_t' and a count or something
  like this). To enable something like this, it would be helpful if the actual
  conversion functions were actually separated from the facets: The facets
  would just call the corresponding conversion functions. The actual
  conversion is, apart from the state argument, entirely stateless and there
  is no need to bind it to any facet.

There already exist a facility to select the correct facet according to the byte order mark. It's very simple to use:

std::wifstream file("MyFile", std::ios_base::binary);
boost::utf::imbue_detect_from_bom(file);

that's it.

I considered separating the facets class from the conversion implementation code, but I decided to keep the design one facet == one encoding. There are a couple of minor advantages:

1) the user can have just one or two encondings in your executable without taking all of them. (Of course if you use imbue_detect_from_bom you will need all encodings, anyway... ;)

2) I avoid one indirection. The cost of this cannot be underestimated. The VS.NET implementation is quite brain-dead about codecvt: it will call the do_in function (which is virtual!) up to 4 times *per character*. Three times the function will return "partial", the fourth times it will succeed. For every character. The less work I do in the fail case, the better.

- The conversion functions seem to assume that there are input sequence
  consisting of more than one character passed to it. I don't think this is
  guaranteed at all. At the very minimum, the facets should defend against
  being called with too small buffers (at least in 'do_out()' of 'utf16_utf32'
  the constant 3 is substracted from a pointer without prior check that the
  input area consists of at least three characters, thereby possibly invoking
  undefined behavior). I think that it is acceptable behavior of a facet user
  (like eg. a file stream buffer) to feed individual characters to the
  'do_*()' functions or at least to increase the buffers char by char (I think
  that at least one of the file stream buffer implementations actually does
  just that in certain cases!).

This is not correct. No conversion function makes assumptions on the size of the buffers (not intentionally, I mean!)

In the specific case of utf16_utf32 you have:

for(const char* limit = to_limit - 3; from < from_end && to < limit; ++from)
{
// accesses buffer using *to
}

if the buffer is shorter than 4 characters, the test "to < limit" fails immediately and the loop is never entered. The buffer is never accessed before or after the loop, so I don't see any undefined behaviour here.

Looking better, the only case that can go wrong is if the function is called with to == to_limit == 0, but I don't think a stream implementation could be so insane to do that. I can add a check for this case, though. Maybe under assert()?

- My understanding of 'encoding()' is that '-1' indicates that "shift states"
  are used and that a special termination sequence returning the state into
  some initial state is possibly needed. That is, although the 'state'
  variable is used, the UTF-* encodings are considered to be stateless and
  there is no need to 'unshift()' anything because the 'do_out()' function
  will eventually write all what is necessary if it gets a big enough output
  buffer. That is, the 'encoding()' members should probably return a '0' or a
  positive integer for all UTF-* encodings since the encodings are essentially
  stateless. My personal feeling is that UTF-16 is, however, a variable width
  encoding, even if the internal encoding is also UTF-16: A perverted user of
  the facet might legally present the indivual words out of order to the
  facet of a fixed width encoding, eg. in a first pass all odd words and in a
  second all even words. This would rip surrogate sequences apart causing
  unexpected errors.

You are right, UTF-* encoding are in fact stateless. However, for a reason too complex to describe here (it will be written in the documentation for sure!) the facets that use UTF-16 internally need to be state-dependent in order to support surrogates. The UTF-32 facets don't need a special treatment and are indeed stateless.

If you can't wait for docs to be written, the root of all evil is the C++ Library issue #76:

http://anubis.dkuug.dk/jtc1/sc22/wg21/docs/lwg-defects.html#76

The code itself is pretty readable. However, I haven't made a thorough
review of it [yet].

Thanks! Then I have to prepare the docs quickly! I put more comments in the code than I usually do, but I feel they are not enough to explain everything ;)

What would be very helpful, at this stage, is having people who could compile and run the test suite on their favorite platform. This will surely stress the library and give some precious feedback.

Next step for me: prepare a test suite that overcome the mbstate_t problem ASAP.

Alberto

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[boost] Re: UTF library available for review

Reply via email to