Hello,
I have an application where I have a few hundred million events, and I'd
like to make and work with different selections of sets of those events.
The events each have various values associated with them, say for
simplicity color, timestamp, and loudness. Say one selection includes all
the events within 5 minutes after a blue event. Or I want to select all
events that aren't above some loudness threshold. I'd like to be able to
save these selections in a JLD file for later use on some or all events. I
also need to be able update the selections as I observe more events.
My baseline plane it to have an integer associated with each event and each
bit in the integer i corresponds to a selection. So bit 1 is true for
events within 5 minutes and bit 2 is true for events above the loudness
threshold. Then for each event's integer I can do bits(i)[1] or bits(i)[2]
to figure out if it is included in each selection. That seems like it would
be inefficient since bits() returns a string, so I would have to call
bool(bits(i)[1]). I could use a bitwise mask of some sort like 1&i==0 for
the first bit and 2&i==0 for the second bit.
A BitArray seems like a decent choice, except that you can only read/write
the entire array from a JLD file, rather than just a part of it. That will
be inefficient since I'll often want to look at only a small subset of the
total events. And every time I want to update for new events, I would need
to read the entire BitArray, extend it in memory, then delete the old JLD
object and replace it with a new JLD object. It seems plausible I could
figure out how to read/write part of a BitArray from a JLD as I've already
done some hacking on HDF5.jl, but that could be a large amount of work.
An Array{Bool} works well with JLD, and seems just as well suited as a
BitArray. I think it's 8 times bigger than BitArray, and has a similar
space ratio to an integer (depending on how many selections I actually use)
because bools are stored as 1 byte? I can probably live with that, although
again it seems sort of inefficient.
Any advice on how I should go about deciding, or options I hadn't
considered? Also why does bits() return a string, instead of say
Vector{Bool} or BitArray?