In my work on Ugarit (and, hence, lzma) I've been shifting a lot of arbitrary binary data about. I plumped for SRFI-4 u8vectors for Ugarit, purely because at one point I want to prefix them with a compression-algorithm byte and then strip that byte off again elsewhere, but I feel in my heart of hearts I should have used blobs; for I have no idea what the structure of the data I find in random files being backed up is.
There's a long tradition of considering arbitrary binary data as a byte array, but that's slightly overspecifying it, in my opinion. Just because any bit of memory *can* be interpreted as a byte array doesn't mean it *is* a byte array, any more than the fact that any file in the filesystem *can* be edited in vi means that it makes sense to do so. For a start, as srfi-4 makes clear, that region of memory can be seen as a u8vector or an s8vector... I think there's a valid semantic distinction between a blob - which is purely a region of storage, which happens to be a multiple of 8 bits in length - and a byte vector. Which is why I made the lzma egg operate on blobs; lzma:compress and lzma:decompress are just functions of type blob -> blob. I saw that the z3 egg, which does a similar job, chose to use strings; there's been some history of using strings for arbitrary data in Chicken, which I think is wrong - strings imply character-sequence semantics. So, I propose that people be mindful of this distinction and try and make more use of the blob type. I don't propose breaking existing code; things that operate on arbitrary data can happily accept blob/ string/u8vector/s8vector, but I think blob should be the default in people's minds! Further to this, I am considering throwing together some useful blob tools, to allow more to be done with blobs without needing to copy them so much, and to deal with bigger blobs. This would comprise: 1) Replacements for the core blob functions, which operate on blobs composed of a c-pointer and a size. (make-blob size) would malloc size bytes and construct a blob with a finalizer that called free. Perhaps for blobs below a certain size it'd just allocate them from the nursery, using the normal approach to blobs and thus avoiding registering a finalizer, and all the other blob functions would have a conditional to detect which blob representation was in use. However, ffi code that returns blobs can then easily wrap a malloced pointer returned by a C function, and have the finalizer call free on it; or use a different finalizer if the memory comes from some other kind of memory pool. The flexibility would be there. 2) A wrapper for the mmap stuff in the posix unit, adding a function that returns a blob wrapping the mmapped region, with a reference count; when the last blob goes away, it's un-mmapped. 3) Blob I/O on file descriptors - file-read in the POSIX unit should return a blob by default, not a string! It's too late to change that, so I'd add a file-read-blob, and make file-write accept new-style blobs. 4) Similarly sidling the new blobs into the lolevel unit functions, so they can be move-memory!ed to/from, the pointer extracted, and all that. 5) A new srfi-4, which uses a blob as the underlying storage for every vector. blob->*vector/shared reuses an existing blob, and the non-/ shared versions just duplicate the blob then use that. An actual srfi-4 vector would become a record referencing the underlying blob, a starting offset (subject to alignment, of course), and the length of the vector in elements, so any subregion of a blob can be viewed as an SRFI-4 vector; this would mean that sub*vector/shared functions could be created that just made a new vector-record referencing the same blob, but with reduced offset/length fields. This would mean that there'd be a lot less copying involved in dealing with blobby data. A foreign function that returns a malloced block could have it returned in Chicken as a blob with zero copying, and Chicken code could then happily interpret different parts of it as any of the srfi-4 types by just dropping lightweight shared vector wrappers onto it. What do people think of this? Would it be welcome in the chicken core once it's proven itself? ABS -- Alaric Snell-Pym Work: http://www.snell-systems.co.uk/ Play: http://www.snell-pym.org.uk/alaric/ Blog: http://www.snell-pym.org.uk/?author=4 _______________________________________________ Chicken-users mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/chicken-users
