At 3:36 PM +0100 on 1/21/00, M. Uli Kusterer wrote:
>>Because that's eight bytes of pointer-ish data for a five-byte word.
>>Ouch. And it'd REALLY slow down "the number of words in", which would
>>be spitting out somewhere around 8MB of data for every 5MB in. For
>>certain things it would not be as bad, and for short words it would be
>>even worse.
>
>Anthony,
>
> another approach would be to remember the length of each word and
>space in a list. Every second length would be a space length, and
>calculating the offset would in most cases (i.e. words > 4
>characters)
Word greater than eight characters would notice a small speedup,
increasing as word length increased.
This is even worse than the look-up table idea, as you seem to have
found out below. Drop it :)
>Text "This house is just too cool to be pulled down"
>
>would be complemented by:
>4,1,5,1,2,1,4,1,3,1,4,1,2,1,2,6,1,4
>
>18 * 4 = 72 bytes at fixed offsets to scan instead of 45 ... ewwwww?!
>
>You know, sometimes English is a really inconvenient language ... Of
>course, we could use shorts instead, which would be 36 bytes (I doubt
>anyone has a 64k word in a text, right?
Don't bet on it. What happens with binary data? What happens with data
using space as an itemdelimiter for really large records?
Likely? No. But you're barking up the wrong tree.
>Of course, we'd introduce a
>limit...), which would already be a gain. Or we could just store the
>start offset of each word.
Half the data as your origonal plan, but still no good.
>
> I guess we'll have to heed Stroustrup's experience here ... create
>applications that test what is the most effective and use that.
That would be a good plan. And I think whatever I do will be better
whenever I happen to have enough time to do it <veg>.
>
>>I would, of course, keep track of the last item. And 10 or so inbetween
>>-- though it is rather hard to do without knowning the last item; I'd
>>have to guess based on the byte number.
>
> Do you want to globally keep track of the last item (how?) or on a
>per-string/per-variable basis?
When I interpret a 'the number of [whatever] of container' I'd keep
track of, for that container:
the item number and offset to the last item
the item number and offset to some of the intervening items
>
>>I'd count from the best known position, which will seldom be the
>>beginning. If I know word 576 begins at x, and I want word 573, I'll
>>start at x and count backwards.
>>
>>I'll also use those 10 or so that I found when counting the words if I
>>have them. Once again: Pick the closest point and count from there.
>
> Right, why didn't I think of that? Surely would be an advantage that
>should gain a good chunk of bytes.
>
>>3 fields -- 12 bytes -- is not that bad. It's not like a variable is
>>exactly that small.
>
> The trouble is that these fields are only used in a small subsection
>of variables, and there might be hundreds of variables in a script.
>That would mean 12k per script more memory used. It quickly adds up.
A script with 1024 variables?! What are you up to?! And even with 1K of
variables, 12K is not that bad -- considering how much data you'd have
in them!
>
>>Why not? As soon as I get some time to pull off some more magic, they will.
>
> How do you intend to optimize nested chunks this way? I don't see
>how I'd have Joker cache a chunk's last value ... I could probably do
>it for loops,
Good start -- loops are where it really matters.
>by having the code cache it, but if I want to keep the
>last read chunk's offset with its variable so even several accesses
>in a row (but not in a loop) are optimized ... My instincts tell me
>it's possible, but it'll take some time until my brain realizes how.
I haven't implemented it, either.
>>Might also want to implement some LRU caching for those silly nonlinear
>>accesses :)
>
> LRU ???
Least-Recently Used. It refers to that way in which data is expelled
from the cache (the least-recently used item is exorcised).