How about a memory mapped file? Not lazy at all, but could be quick, given
that you have enough memory.

http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/MappedByteBuffer.html

There can be times where a database is too low performant or clumsy for
quick searching in a large utf-8 file, but some kind of indexing
seems necessary if exact char position is needed quickly, skiplists or
frames where one is told how many and how long escapechars was used since
the beginning with some heuristic search would be possible solutions. Maybe
there are more efficient data structures (bzip?) when one need fast access
during ram memory constraints and have information with low entrophy
(accesslogs).

/Linus

2012/1/5 Steve Miner <stevemi...@gmail.com>

>
> On Jan 5, 2012, at 5:07 PM, Andy Fingerhut wrote:
>
> > I realize that with variable-length multi-byte character encodings like
> UTF-8, it would be a bad idea to seek to a random byte position and start
> trying to decode a UTF-8 character starting at that byte position.  I'm
> thinking of cases where you have an index of byte positions of interest you
> want to jump to in the future that are known to be the first byte of a
> character in the appropriate encoding.  I also realize that one must be
> very cautious in writing to the middle of such a file, since byte lengths
> of strings are variable.
>
>
> I can't help too much, but the comment about UTF-8 rang a bell.  It's
> actually not that hard to find a valid character by jumping to a random
> position.  You just need to be able to back up a few bytes.
>
> http://en.wikipedia.org/wiki/UTF-8
>
> >       * All continuation bytes (byte nos. 2-6 in the table above) have
> 10 as their two most-significant bits (bits 7-6); in contrast, the first
> byte never has 10 as its two most-significant bits. As a result, it is
> immediately obvious whether any given byte anywhere in a (valid) UTF-8
> stream represents the first byte of a byte sequence corresponding to a
> single character, or a continuation byte of such a byte sequence.
>
> >       * As a consequence of no. 3 above, starting with any arbitrary
> byte anywhere in a (valid) UTF-8 stream, it is necessary to back up by only
> at most five bytes in order to get to the beginning of the byte sequence
> corresponding to a single character (three bytes in actual UTF-8 as
> explained in the next section). If it is not possible to back up, or a byte
> is missing because of e.g. a communication failure, one single character
> can be discarded, and the next character be correctly read.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to