Re: [bitc-dev] Unicode and bitc

William Leslie Thu, 14 Oct 2010 00:04:19 -0700

On 14 October 2010 16:28, Ben Kloosterman <[email protected]> wrote:
>
>  >If I really want an indexed UTF-8 string, I say that, and if I
>  >want a rope, I also have to tell you that.
>
> I do favour pushing this to the user  when its necessary but shouldn’t the
> default be reasonably fast and memory efficient ? Especially when most of
> those strings are just storage and not being used /worked on.

I'm not really sure how true that is. I can only speak from my own
experiences, which are admittedly pretty limited. Most string data I
see are:

0. Attributes of domain objects for display, or the results of
applying simple functions to them. Peoples names, for example. I
wouldn't have snarfed these out of the database if I wasn't using
them.
1. XML messages (say, in a JMX environment).
2. HTML / XML templates.
3. Symbols.

These things either seem to be processed in a stream-like manner
(templates) or are manipulated and searched frequently.

The only case where tighter storage space would be useful to me is in
the symbol case. I haven't seen recent GC dumps from our translation
toolchain, but IIRC there were a lot of strings in use as symbols.
Within a parser itself it probably isn't important - you generate
objects from the source and immediately throw the source away - but
small strings float around in various metadata, such as debugging
information, and method descriptors.

> Also I don’t see BitC growing past a single runtime for a LONG time... :-)
> I'm not sure it is meant to be the antithesis as I kind of like the smart
> runtimes and the drop the file and it always runs rather than spending ages
> hacking / configuring and trying to compile something .

I mean smart in a semantic way, in that it will try to do a lot of
clever optimisation. The intention seems to be 'do what I say, not
what I mean', So you can reason about what happens near to the
hardware by looking at the source. Selecting different string
implementations depending on the construction of the string is
something I'd classify as clever. I do like it, but I am not convinced
it should be the default for BitC (yet).

> Certainly not on the string itself but in the standard lib  you could use a
> 64 bit value and the high bit whether it's a  byte index or char index when
> setting the char index immediately convert to a byte. It wouldn’t be that
> expensive since it is only indexes on active worked on strings. It would
> just allow most users to use char indexes and when needed the real byte
> indexes. Im not really sold on this idea...it is probably easier to just say
> they are byte indexes ..

So you are saying you (who may be the implementer of the string
module) could convert to a specific representation where needed in
string-heavy functions? I think we are starting to agree here.

>  >On 14 October 2010 13:25, Ben Kloosterman <[email protected]> wrote:
>  >> As 32 bit chars is clearly unacceptable
>  >
>  >For some things. I don't think UCS-4 is a particularly bad choice for
>  >an in-memory encoding, it's the one I'd choose, and I'm not even from
>  >California.
>
> :-)
> I think the memory overhead is too bad in business application and DB
> land... but it is a good candidate for the worked on  fixedchararrays . Note
> I'm not viewing it from the point of view of Unix apps reading the data and
> then finishing but more from memory hungry Business apps, App servers ,  Web
> servers , SOA servers , DBs etc where the strings stay in memory for a long
> time while not used.

If it is a problem, when it is a problem, when the app happens to be
written in BitC, it would be nice to change the representation as
needed. I think the verdict is still out on whether UTF-8 is yet a
nice default, though. It certainly makes life more difficult for
people who want to implement VMs in BitC, most of which imply O(1)
indexing of their native string type, and having to always convert to
use native string functions would be awkward.

Do consider that most business applications are already written in
Java or C#, and that Hotspot, for example, already incurs extra
overhead to store extents on the string and room for a lock for
synchronisation. People seem not to care a whole lot about small order
of magnitude size increases for extra functionality. It's far cheaper
to buy another server than to write C.

-- 
William Leslie

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode and bitc

Reply via email to