On Sun, 28 Aug 2011 13:25:58 +0200 Juan Jose Garcia-Ripoll <juanjose.garciarip...@googlemail.com> wrote:
> I think we have two different models in mind. This is how I see it That's possible, or some minor misunderstanding; I'll read the following carefully. > * READ/WRITE-BYTE do not have external formats. period. They are binary I-O > functions and the only customizable thing they have is the word size and the > endinanness, but they do not know about characters. This makes sense to me as well. > * Binary sequence streams are streams built on arrays of integers. They are > interpreted as collection of octets. The way these octets are handled is > determined by two things: > - The array element type influences the byte size of READ/WRITE-BYTE. > - The external format determines how to read characters *from the > octets*, independently of the byte size. > This means that if you construct a binary sequence stream you will have to > pay attention to the endianness of your data! If I understand the above, READ-CHAR on a binary/bytes sequence stream, created on top of an array of UTF-8 octets would then work if the EXTERNAL-FORMAT was UTF-8 (and signal a decoding error with restart on invalid octets). If that's what it means, this is what I need. I need to leverage ECL's UTF-8 decoder to convert arbitrary binary byte/octets to create unicode strings (without needing to worry about the internal codepoint representation of ECL strings). I also need to be able to encode ECL unicode characters (and strings, also without worrying about the internal representation) to UTF-8 binary octets (and this part worked fine in your initial sequence streams implementation). Meaning that WRITE-CHAR to a binary bytes vector stream with UTF-8 octets would work with UTF-8 EXTERNAL-FORMAT. In this sense, the external format has to do with the binary sequence format only, with the character and string representation remaining internal and transparent. > * Common Lisp strings have a fixed external format: latin-1 and unicode in > ECL. This can not be changed and I do not want to change it in the > foreseeable future. In consequence with the previous statement, my code did > not contemplate reinterpretation of strings with different external formats. > I still feel uneasy about this idea, because this is only a signature that > you got your data wrong. Nevertheless I have made the following changes. I understand that ECL uses the internal representation that it wishes (currently UBCS-4 for unicode, or LATIN-1), and that string streams will also use that, and I shouldn't have to worry about it. I don't think that I need reinterpretation of strings, there must have been some misunderstanding there, possibly relating to a bug in the previous example/test code I sent. Sorry about that if so. > - If no external format is supplied, a sequence stream based on a string > works just like a string stream. Stop reading here then. > > - Otherwise, if the string is a base-char strings work like binary streams > with a byte size of 8 bits. This means they can be used for converting to > and from different external formats by reinterpreting the same characters as > octets. > > - If the string contains extended characters, this fact is ignored and the > string is interpreted as if it contained just 8-bit characters for external > encodings. This means that now you can recode strings and ignore whether the > string was stored in a Unicode or a base-char string. I guess that I could use this functionality to "truncate" characters, but this was already possible without too much trouble by user code and I don't really need this. On a tengent, I remember a previous thread where the internal ECL unicode format representation was discussed, that it could perhaps be changed eventually, at least on Windows, and that I replied that I already had code relying on the internal representation being host-endian UBCS-4. Well with the new sequence streams, I will be able to stop using my own UTF-8 encoder/decoder too, and it won't be necessary anymore to worry about the internal representation, as ECL native encoder/decoders will no longer be a black box for user code. Previously, the only way to leverage the ECL character encoding/decoding libraries from user code was to use files or sockets. For anything more efficient or more complex I had to use custom encoding/decoding code. Unfortunately the mmap changes broke the ECL build and I couldn't immediately test your latest changes: ;;; Emitting code for INSTALL-BYTECODES-COMPILER. ;;; Note: ;;; Invoking external command: ;;; gcc -I. -I/home/mmondor/work/ecl-git/ecl/build/ -DECL_API -I/home/mmondor/work/ecl-git/ecl/build/c -I/usr/pkg/include -march=i686 -O2 -g -fPIC -Dnetbsd -I/home/mmondor/work/ecl-git/ecl/src/c -O2 -w -c clos/bytecmp.c -o clos/bytecmp.o ;;; Finished compiling EXT:BYTECMP;BYTECMP.LSP. ;;; Condition of type: SIMPLE-ERROR EXT::MMAP failed. Explanation: Invalid argument. No restarts available. Top level in: #<process TOP-LEVEL>. > It's possible that MAP_FILE | MAP_SHARED be needed rather than only MAP_SHARED. Thanks again, -- Matt ------------------------------------------------------------------------------ EMC VNX: the world's simplest storage, starting under $10K The only unified storage solution that offers unified management Up to 160% more powerful than alternatives and 25% more efficient. Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev _______________________________________________ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list