simon 01/09/10 03:01:53
Modified: docs strings.pod
Log:
New string documentation.
Revision Changes Path
1.2 +266 -1 parrot/docs/strings.pod
Index: strings.pod
===================================================================
RCS file: /home/perlcvs/parrot/docs/strings.pod,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -w -r1.1 -r1.2
--- strings.pod 2001/09/03 17:26:52 1.1
+++ strings.pod 2001/09/10 10:01:52 1.2
@@ -22,9 +22,84 @@
The most basic way of creating a string is through the function
C<string_make>:
- STRING* string_make(char *buffer, IV buflen, IV encoding, IV flags, IV type)
+ STRING* string_make(void *buffer, IV buflen, IV encoding, IV flags, IV type)
+In here you pass a pointer to a buffer of a given encoding, and the
+number of bytes in that buffer to examine, the encoding, (see below for
+the C<enum> which defines the different encodings) and the initial
+values of the C<flags> and C<type> field. These should usually be zero.
+In return, you'll get a brand new Parrot string. This string will
+have its own private copy of the buffer, so you don't need to keep it.
+=over 3
+
+=item *
+
+I<Hint>: Nothing stops you doing
+
+ string_make(NULL, 0, ...
+
+=back
+
+If you already have a string, you can make a copy of it by calling
+
+ STRING* string_copy(STRING* s)
+
+This is itself implemented in terms of C<string_make>.
+
+When a string is done with, it can be destroyed using the destroyer
+
+ void string_destroy(STRING *s)
+
+=head2 String Manipulation Functions
+
+Unless otherwise stated, all lengths, offsets, and so on, are given in
+characters; you are not allowed to care about the byte representation of
+a string, so it doesn't make sense to give the values in bytes.
+
+To find out the length of a string, use
+
+ IV string_length(STRING *s)
+
+You I<may> explicitly use C<< s->strlen >> for this since it is such a
+useful operation.
+
+To concatenate two strings - that is, to add the contents of string
+C<b> to the end of string C<a>, use:
+
+ STRING* string_concat(STRING* a, STRING *b, IV flag)
+
+C<a> is updated, and is also returned as a convenience. If the flag is
+set to a non-zero value, then C<b> will be transcoded to C<a>'s encoding
+before concatenation if the strings are of different encodings. You
+almost certainly don't want to stick, say, a UTF-32 string on the end of
+a Big-5 string.
+
+Chopping C<n> characters off the end of a string is achieved with the
+unlikely-sounding
+
+ STRING* string_chopn(STRING* s, IV n)
+
+B<Not implemented>:
+To retrieve a substring of the string, call
+
+ STRING* string_substr(STRING* src, IV offset, IV length, STRING** dest)
+
+The result will be placed in C<dest>.
+(Passing in C<dest> avoids allocating a new string at runtime. If
+C<*dest> is a null pointer, a new string structure is created with the
+same encoding as C<src>.)
+
+B<Not implemented>:
+To format output into a string, use
+
+ STRING* string_nprintf(STRING* dest, IV len, char* format, ...)
+
+C<dest> may be a null pointer, in which case a new B<native> string will
+be created. If C<len> is zero, the behaviour becomes more C<sprintf>ish
+than C<snprintf>-like.
+
+
=head1 Elements of the C<STRING> structure
Those implementing the C<STRING> API will obviously need to know about
@@ -46,16 +121,41 @@
=head2 C<bufstart>
+This pointer points to the buffer which holds the string, encoded in
+whatever is the string's specified encoding. Because of this, you should
+not make any assumptions about what's in the buffer, and hence you
+shouldn't try and access it directly.
+
=head2 C<buflen>
+This is used for memory allocation; it tells you the currently allocated
+size of the buffer in bytes.
+
=head2 C<bufused>
+C<bufused> on the other hand, contains the number of bytes out of the
+allocated buffer which are actually in use. This, together with
+C<buflen>, is used by the buffer growing algorithm to determine when and
+by how much to grow the allocation buffer.
+
=head2 C<flags>
+This is a general holding area for string flags. The exact flags
+required have not yet been determined.
+
=head2 C<strlen>
+This is the length of the string in characters, as you would expect to
+find from C<length $string> in Perl. Again, because string buffers may
+be in one of a number of encodings, this must be computed by the
+appropriate encoding function. C<string_compute_strlen(STRING)> updates
+this value, calling the C<compute_strlen> function in the STRING's
+vtable.
+
=head2 C<encoding>
+This specifies the encoding of the buffer, from the following C<enum>:
+
enum {
enc_native,
enc_utf8,
@@ -65,7 +165,172 @@
enc_max
};
+The "native" string type is whatever happens when you set C<LANG=C> in
+your shell; it's usually ISO-8859-1 in most English-speaking machines.
+A character equals a byte equals eight bits. No shifts, no wide
+characters, nothing.
+
+UTF8, UTF16, and UTF32 are what they sound like. UTF16 and UTF32 should
+use the native endianness of the machine.
+
+C<enc_foreign> is there to allow for expansion; foreign strings will
+call functions from a user-defined string vtable instead of the Perl
+built-in ones.
+
+C<enc_max> isn't an encoding. These aren't the droids you're looking for.
+It's just there to help know how big to make arrays.
+
=head2 C<type>
+XXX I don't know what this is for.
+
=head2 C<unused>
+
+This field is, as its name suggests, unused; however, it can be used to
+hold a pointer to the correct vtable for foreign strings.
+
+=head1 String Vtable Functions
+
+The L</String Manipulation Functions> above are implemented in terms of
+string vtables to create encoding abstraction; here's an example of one:
+
+ STRING*
+ string_concat(STRING* a, STRING* b, IV flags) {
+ return (ENC_VTABLE(a).concat)(a, b, flags);
+ }
+
+C<ENC_VTABLE(a)> is shorthand for:
+
+ Parrot_string_vtable[a->encoding]
+
+The C<Parrot_string_vtable> is a static array of virtual tables, defined
+in C<string.c>. Each encoding has its own vtable; to call the
+concatenation function for C<a>, we look up its encoding and retrieve
+the C<concat> entry from that encoding's vtable. This produces a
+function pointer we can throw the arguments at.
+
+Most of the string vtable functions are self-explanatory as they are
+thin wrappers around the functions given above. Some of them, however,
+are for internal use only, to help implement other functions. You'll
+find them in the next section.
+
+=head2 How to add new vtable functions
+
+The first thing to note is that if what you're doing isn't remotely
+encoding-specific, you don't need to add a vtable function; you can
+just add a function in F<string.c> (don't forget to add the function
+prototype to F<string.h>) and you don't need any more of this section.
+However, most things that people do with strings depend on the encoding
+of the string data, so if you need to add anything slightly complex,
+read on.
+
+Currently, the construction of the vtables is not automated; it's hoped
+that soon someone will automate this and fix this section. However, for
+the time being, this is what you need to do when you implement a new
+vtable function:
+
+=over 3
+
+=item 1
+
+Check to see whether or not the function's type has a typedef in
+F<string.h>: for instance, if you have a function that takes a string
+and an C<IV> and returns a string, use C<string_iv_to_string_t>;
+otherwise, add your own type.
+
+=item 2
+
+Add the unqualified name of the function (C<frobnicate>), together with
+your type, to C<string_vtable> in F<string.h>.
+
+=item 3
+
+Create a function C<string_frobnicate> in C<string.c> which is a wrapper
+around C<frobnicate>. This function B<must> take a C<STRING*> parameter,
+so that the encoding can be extracted and the relevant encoding vtable
+be found and despatched. It should look something like this:
+
+ yadda
+ string_frobnicate(STRING *s, ...) {
+ return (ENC_VTABLE(s).frobnicate)(s, ...);
+ }
+
+=item 4
+
+Create functions C<string_XXX_frobnicate> for all values of C<XXX> in
+the encoding table; (or better still, get other people to write them for
+you) C<string_native_frobnicate> should go in F<strnative.c>,
+C<string_utf8_frobnicate> should go in F<strutf8.c>, and so on.
+
+=item 5
+
+Add C<string_XXX_frobnicate> to the end of each vtable returned by
+C<string_XXX_vtable>.
+
+=back
+
+=head1 Non-user-visible String Manipulation Functions
+
+If you've read this far, I hope you're a Parrot implementor. If you're
+not helping construct the Parrot core itself, you probably want to look
+away now.
+
+The first two functions to note are
+
+ IV string_compute_strlen(STRING* s)
+
+and
+
+ IV string_max_bytes(STRING *s, IV iv)
+
+The first updates the contents of C<< s->strlen >> by contemplating the
+buffer C<bufstart> and working out how many characters it contains. The
+second is given a number of characters which we assume are going to be
+added into the string at some point; it returns the maximum number of
+bytes that need to be allocated to admit that number of characters. For
+fixed-width encodings, this is trivial - the "native" encoding, for
+instance, encodes one byte per character, so C<string_native_max_bytes>
+simply returns the C<IV> it is passed; C<string_utf8_max_bytes>, on the
+other hand, returns three times the value that it is passed because a
+UTF8 character may occupy up to three bytes.
+
+To grow a string to a specified size, use
+
+ void string_grow(STRING *s, IV newsize)
+
+The size is given in characters; C<string_max_bytes> is called to turn
+this into a size in bytes, and then the buffer is grown to accomodate
+(at least) that many bytes.
+
+=head1 Transcoding
+
+The fact that Parrot strings are encoding-abstracted really has to
+bottom out at some point, and it's usually when two strings of different
+encodings interact. When we try to append one type of string to another,
+we have the option of turning the later string into a string that
+matches the first string's encoding. This process, translating a string
+from one encoding into another, is called "transcoding".
+
+In Parrot, transcoding is implemented by the two-dimensional array
+
+ Parrot_transcode_table[enc_from][enc_to]
+
+Each entry in this table is a function pointer which takes two
+parameters:
+
+ string_utf32_to_utf8(STRING* from, STRING* to)
+
+(If C<to> is a null pointer, a new C<STRING*> will be allocated. As
+before, it's all about avoiding memory allocation at runtime.)
+
+A null pointer in the table should signify that no transcoding is
+necessary; C<Parrot_transcode_table[x][x]> should always be C<NULL>.
+
+C<Parrot_transcode_table[enc_native][enc_utf8]> isn't C<NULL>. Don't
+fall for that, because "native" doesn't necessarily mean ISO-8859-1.
+
+=head2 Foreign Encodings
+
+Fill this in later; if anyone wants to implement new encodings at this
+stage they must be mad.