Re: [DRAFT PPD] External Data Interfaces

Nicholas Clark Sun, 18 Aug 2002 14:59:31 -0700

On Sat, Aug 17, 2002 at 03:58:32PM -0700, Brent Dax wrote:

> =head2 Strings
> 
> Parrot-level C<String>s are to be represented by the type
> C<Parrot_String>.  This type is defined to be a pointer to a C<struct
> parrot_string_t>.
> 
> The functions for creating and manipulating C<Parrot_String>s are listed
> below.


Is it worth arranging a reminder in here that as parrot is garbage collected
there is no confusion about who owns pointers to blah?

> =item C<Parrot_String Parrot_string_new(Parrot_Interp, char* bytes,
> Parrot_Int len, Parrot_String enc)>
> 
> Allocates a Parrot_String and sets it to the first C<len> bytes of
> C<bytes>.  C<enc> is the name of the encoding to use (e.g. "ASCII",
> "UTF-8", "Shift-JIS"); if a case-insensitive match of this name doesn't
> result in an encoding name that Parrot knows about, or if NULL is passed
> as the encoding, the platform's default encoding is assumed.[1]  Values
> of NULL and 0 can be passed in for C<bytes> and C<len> if the user
> desires an empty string.

Should that char * be const char *?

> Note that it is rarely a good idea to not specify the encoding if you're
> using C<bytes> and C<len>.

I'm a native English speaker and I'm finding that double negative hard to
work out. Is there a clearer way to phrase it?

> =item C<Parrot_String Parrot_string_copy(Parrot_Interp, Parrot_String
> dest, Parrot_String src)>
> 
> Sets C<lhs> to C<rhs> and returns C<dest>.  If C<dest> is NULL, a new
> Parrot_String is allocated, operated on and returned.  If C<dest> and
> C<src> are the same, this is a noop.  This may or may not be a
> copy-on-write set; the embedder should not care.

"This might be a copy-on-write set" ...

And do we need a RFC like definition of should/may/must/mustn't?

In which case, surely that should read "the embedded must not care"?

> B<XXX> Is this a good policy?
> 
> =item C<Parrot_String Parrot_string_copy_bytes(Parrot_Interp,
> Parrot_String dest, char* bytes, Parrot_Int len, char* enc)>
> 
> Sets C<dest> to the first C<len> bytes of C<bytes> and returns C<dest>.
> C<enc> is taken to be the encoding of C<bytes>; the Parrot_String will
> retain its original encoding.  (Call C<Parrot_string_transcode> on the
> Parrot_String first if you want to retain C<enc>.)

Again, should that be const char *bytes?

> =item C<void Parrot_string_transcode(Parrot_Interp, Parrot_String str,
> Parrot_String enc)>
> 
> Transcode C<str> to C<enc>.  If C<enc> isn't recognized as a valid
> encoding name by a case-insensitive match, or if it is NULL, the default
> encoding is used.

Encodings are specified in parrot strings (not char *) yet you state
that it's case insensitive. Is case insensitivity well defined on an encoding
basis, or is it actually dependent on the language level?
[eg one might argue that in English þ and Þ aren't the same, but if the
string is in ISO-8859-1 then Parrot isn't going to know whether the name
was specified in English, German or Icelandic. I chose þ because I don't
think there are any foreign words adopted into English spelled with thorn.
Whereas I'd not be surprised if most other accented letters are used in
some or other word]

Independent of that, aren't we opening ourselves up to a big performance
hit by doing case insensitive matching on arbitrary encodings (such as
Unicode)? Which normal form were we going to do it in?
And if the canonical name is defined in (say) ISO 8859-1 but their string is
in Unicode, are we going to convert before deciding whether it is the same?
And if they're in Shift-JIS but we're supplying it in ISO-8859-2 - that's
2 conversions?

It seems faster having names as US-ASCII and being case insensitive, or having
names case sensitive.

> =item C<Parrot_UInt Parrot_string_length(Parrot_Interp, Parrot_String
> str)>
> 
> Returns the length of C<str> in characters.  Note that this is
> "characters", not "bytes"; the string's encoding defines what
> "character" means.

Should you be clear what happens with combining characters?
If so, that's "characters", not "bytes" or "glyphs", isn't it?

Is there a cross reference to what a Parrot_UInt is?

> =item C<Parrot_String Parrot_string_from_cstr(Parrot_Interp, char*
> cstr)>
> 
> Creates a Parrot_String from the given C string.  Assumes the native
> encoding.

const char* ?

> =item C<Parrot_PMC Parrot_pmc_new_vtable(Parrot_Interp, Parrot_VTable
> vtable)>
> 
> Creates a new Parrot_PMC using C<vtable>.  This can be used for
> "private" PMC types.
>
> B<XXX> Is this a good idea or not?

Singletons are considered useful in some language, aren't they?
Without this, would it be hard to efficiently create singletons?

> =item C<void *Parrot_alloc(Parrot_UInt size)>
> 
> Calls the system C<malloc()> with C<size>.

Are you sure you want to set that in stone? "Calls the system malloc or
equivalent"
IIRC on Win32 perl5 supplies a malloc that tracks which (i)thread allocates
memory, and frees all memory on ithread exit. And perl5 comes with its own
malloc, which if often likes to use on *nix.

Nicholas Clark
-- 
Even better than the real thing:        http://nms-cgi.sourceforge.net/

Re: [DRAFT PPD] External Data Interfaces

Reply via email to