On Wed, Dec 23, 2009 at 12:31 AM, Nathan Kurz <[email protected]> wrote:
>> Whereas, using the filesystem really requires a file-flat data >> structure? > > I guess it depends on your point of view: it would be hard (but not > impossible) to do true objects in an mmapped file, but it would be > very easy to do has-a type relationships using file offsets as > pointers. I tend to have a data-centric (rather than > object-centric) point of view, but from here I don't see any data > structures that would be significantly more difficult. Interesting -- I guess if you "made" all the pointers relative (ie, interpreted them so, when reading them), then you could make arbitrary structures. > Do you have a link that explains the FST you refer to? I'm > searching, and not finding anything that's a definite match. "Field > select table"? Sorry -- FST = finite state transducer. It adds optional "outputs" along each edge, over a finite state machine. When used for the terms index, I think it'd be a symmetric letter trie -- ie, both prefixes and suffixes are shared, with the "middle" part of the FST producing the outputs (= index information for that term that uniquely crosses that edge). >> Ie, "going through the filesystem" and "going through shared >> memory" are two alternatives for enabling efficient process-only >> concurrency models. They have interesting tradeoffs (I'll answer >> more in 2026), but the fact that one of them is backed by a file by >> the OS seems like a salient difference. > > For me, file backing doesn't seem like a big difference. Fast > moving changes will never hit disk, and I presume there is some way > you can convince the system never to actually write out the slow > changes (maybe mmap on a RamFS?). What are fast & slow changes here? Fast = new segments that get created but then merged away before moving to stable storage? > I think the real difference is between sharing between threads and > sharing between processes --- basically, whether or not you can > assume that the address space is identical in all the 'sharees'. Yes, process only concurreny seems like the big difference. > I'll mention that, given the New Year, at first I thought 2026 was > your realistic time estimate rather than a tracking number. Heh ;) > I started thinking about how one could do objects with mmap, and > came up with an approach that doesn't quite answer that question but > might actually work out well for other problems: you could literally > compile your index and link it in as a shared library. Each term > would be a symbol, and you'd use 'dlsym' to find the associated > data. > > It's possible that you could even use library versioning to handle > updates, and stuff like RTLD_NEXT to handle multiple > segments. Perhaps a really bad idea, but I find it an intriguing > one. I wonder how fast using libdl would be compared to writing > your own lookup tables. I'd have to guess it's fairly efficient. That is a wild idea! I wonder how dlsym represents its information... Mike
