To me it sounds like you've come up with the main options: BitArray or
Array{Bool}. Since a BitArray is, underneath, a Vector{Uint64} with different
indexing semantics, it seems you could probably come up with a reasonable way
to update just part of it. But even if you use Array{Bool}, you're "only"
talking a few hundred megabytes, which is not a catastrophically large. Also
consider keeping everything in memory; with 100GB of RAM you could store an
awful lot of selections.
--Tim
On Tuesday, August 05, 2014 12:01:58 PM ggggg wrote:
> Hello,
>
> I have an application where I have a few hundred million events, and I'd
> like to make and work with different selections of sets of those events.
> The events each have various values associated with them, say for
> simplicity color, timestamp, and loudness. Say one selection includes all
> the events within 5 minutes after a blue event. Or I want to select all
> events that aren't above some loudness threshold. I'd like to be able to
> save these selections in a JLD file for later use on some or all events. I
> also need to be able update the selections as I observe more events.
>
> My baseline plane it to have an integer associated with each event and each
> bit in the integer i corresponds to a selection. So bit 1 is true for
> events within 5 minutes and bit 2 is true for events above the loudness
> threshold. Then for each event's integer I can do bits(i)[1] or bits(i)[2]
> to figure out if it is included in each selection. That seems like it would
> be inefficient since bits() returns a string, so I would have to call
> bool(bits(i)[1]). I could use a bitwise mask of some sort like 1&i==0 for
> the first bit and 2&i==0 for the second bit.
>
> A BitArray seems like a decent choice, except that you can only read/write
> the entire array from a JLD file, rather than just a part of it. That will
> be inefficient since I'll often want to look at only a small subset of the
> total events. And every time I want to update for new events, I would need
> to read the entire BitArray, extend it in memory, then delete the old JLD
> object and replace it with a new JLD object. It seems plausible I could
> figure out how to read/write part of a BitArray from a JLD as I've already
> done some hacking on HDF5.jl, but that could be a large amount of work.
>
> An Array{Bool} works well with JLD, and seems just as well suited as a
> BitArray. I think it's 8 times bigger than BitArray, and has a similar
> space ratio to an integer (depending on how many selections I actually use)
> because bools are stored as 1 byte? I can probably live with that, although
> again it seems sort of inefficient.
>
> Any advice on how I should go about deciding, or options I hadn't
> considered? Also why does bits() return a string, instead of say
> Vector{Bool} or BitArray?