On Fri, 18 May 2012 03:52:51 -0400, Mehrdad <wfunct...@hotmail.com> wrote:
On Thursday, 17 May 2012 at 14:02:09 UTC, Steven Schveighoffer wrote:
2. I realized, buffering input stream of type T is actually an input
range of type T[].
The trouble is, why a slice? Why not an std.array.Array? Why not some
other data source?
(Check/egg problem....)
Well, because that's what i/o buffers are :) There isn't an OS primitive
that reads a file descriptor into an e.g. linked list. Anything other
than a slice would go through a translation.
I don't know what std.array.Array is.
Another problem I've noticed is the following:
Say you're tokenizing some input range, and it happens to just be a
huge, gigantic string.
It *should* be possible to turn it into tokens with slices referring to
the ORIGINAL string, which is VERY efficient because it doesn't require
*any* heap allocations whatsoever. (You just tokenize with opApply() as
you go, without every requiring a heap allocation...)
However, this is *only* possible if you don't use the concept of an
input range!
How so? A slice is an input range, and so is a string.
Since you can't slice an input range, you'd be forced to use the front()
and popFront() properties. But, as soon as you do that, you're gonna
have to store the data somewhere... so your next-best option is to
append it to some new gigantic array (instead of a bunch of small
arrays, which require a lot of heap allocations), but even then, it's
not as efficient as possible, because there's O(n) extra memory involved
-- which defeats the whole purpose of working on small chunks at a time
with no heap allocations.
(If you're going to do that, after all, you might as well read the
entire thing into a giant string at the beginning, and work with an
array anyway, discarding the whole idea of a range while doing your
tokenization.)
Any ideas on how to solve this problem?
I think I get what you are saying here -- if you are processing, say, an
XML file, and you want to split that into tokens, you have to dup each
token from the stream, because the buffer may be reused.
But doing the same thing for a string would be wasteful.
I think in these cases, we need two types of parsing. One is process the
stream as it's read into a temporary buffer. If you need data from the
temporary buffer beyond the scope of the processing loop, you need to dup
it.
Other way is read the entire file/stream into a buffer, then process that
buffer with the knowledge that it's never going to change.
We probably can have buffer identify which situation it's in, so the code
can make a runtime decision on whether to dup or not.
-Steve