It turns out Vector{Bool} does not work well with JLD. So I played around
with BitArray, and I figure out that it would be pretty easy to use for my
purposes. It seems to me that BitArray could be made a bit more useful by
exporting a reinterpret method. That would certainly make my use case use
less code, but it could also replace the current implementation of bits. I
think it makes more sense for bits to return a BitArray than a String
anyway, since it would be much faster for uses like bits(4)[1]. Would it
be worth a making a pull request adding something like this in base?
(Clearly redefining bits would change behavior and break things, so I'm not
sure how to approach that)
I wrote up some simple example code and it works fine, however it isn't
actually any faster than the current bits implementation which I found
surprising. Maybe it would be a bit faster if BitArray were able to be
constructed from v directly, instead of allocating then immediately
replacing b.chunks.
function reinterpret2(::Type{BitArray}, v::Vector{Uint64}, dims=(64,-1))
dims[2] == -1 && (dims=(64,length(v)))
# check to make sure the dims are appopriate for length of v
b = BitArray(dims...)
b.chunks = v
b
end
function reinterpret2(::Type{BitArray}, i::Uint64, dims=64)
assert(dims <= 64)
b = BitArray(dims)
b.chunks = [i]
b
end
bits2(i::Uint64) = reinterpret2(BitArray, i)
bits2(x) = reinterpret2(BitArray,reinterpret(Uint64, x))
testbits(n) = [bits(i)[1] for i=1:n]
testbits2(n) = Bool[bits2(i)[1] for i=1:n]
testbits(1);
@time testbits(100000);
testbits2(1);
@time testbits2(100000);
On Tuesday, August 5, 2014 11:26:39 PM UTC-6, Simon Kornblith wrote:
>
> Assuming you have enough memory to write a BitArray to the JLD file
> initially, if you later open the JLD file with mmaparrays=true and read
> it, JLD will mmap the underlying Vector{Uint64} so that pieces are read
> from the disk as they are accessed. (The actual specifics of how this works
> is up to the OS, but generally it works well.) In principle you can also
> modify the BitArray the changes will be saved to the disk, although I'm not
> sure how well that works since I don't do it in my own code. There is no
> easy way to resize the BitArray if you do this, though.
>
> Simon
>
> On Tuesday, August 5, 2014 5:06:16 PM UTC-4, Tim Holy wrote:
>>
>> To me it sounds like you've come up with the main options: BitArray or
>> Array{Bool}. Since a BitArray is, underneath, a Vector{Uint64} with
>> different
>> indexing semantics, it seems you could probably come up with a reasonable
>> way
>> to update just part of it. But even if you use Array{Bool}, you're "only"
>> talking a few hundred megabytes, which is not a catastrophically large.
>> Also
>> consider keeping everything in memory; with 100GB of RAM you could store
>> an
>> awful lot of selections.
>>
>> --Tim
>>
>> On Tuesday, August 05, 2014 12:01:58 PM ggggg wrote:
>> > Hello,
>> >
>> > I have an application where I have a few hundred million events, and
>> I'd
>> > like to make and work with different selections of sets of those
>> events.
>> > The events each have various values associated with them, say for
>> > simplicity color, timestamp, and loudness. Say one selection includes
>> all
>> > the events within 5 minutes after a blue event. Or I want to select
>> all
>> > events that aren't above some loudness threshold. I'd like to be able
>> to
>> > save these selections in a JLD file for later use on some or all
>> events. I
>> > also need to be able update the selections as I observe more events.
>> >
>> > My baseline plane it to have an integer associated with each event and
>> each
>> > bit in the integer i corresponds to a selection. So bit 1 is true for
>> > events within 5 minutes and bit 2 is true for events above the loudness
>> > threshold. Then for each event's integer I can do bits(i)[1] or
>> bits(i)[2]
>> > to figure out if it is included in each selection. That seems like it
>> would
>> > be inefficient since bits() returns a string, so I would have to call
>> > bool(bits(i)[1]). I could use a bitwise mask of some sort like 1&i==0
>> for
>> > the first bit and 2&i==0 for the second bit.
>> >
>> > A BitArray seems like a decent choice, except that you can only
>> read/write
>> > the entire array from a JLD file, rather than just a part of it. That
>> will
>> > be inefficient since I'll often want to look at only a small subset of
>> the
>> > total events. And every time I want to update for new events, I would
>> need
>> > to read the entire BitArray, extend it in memory, then delete the old
>> JLD
>> > object and replace it with a new JLD object. It seems plausible I
>> could
>> > figure out how to read/write part of a BitArray from a JLD as I've
>> already
>> > done some hacking on HDF5.jl, but that could be a large amount of work.
>> >
>> > An Array{Bool} works well with JLD, and seems just as well suited as a
>> > BitArray. I think it's 8 times bigger than BitArray, and has a similar
>> > space ratio to an integer (depending on how many selections I actually
>> use)
>> > because bools are stored as 1 byte? I can probably live with that,
>> although
>> > again it seems sort of inefficient.
>> >
>> > Any advice on how I should go about deciding, or options I hadn't
>> > considered? Also why does bits() return a string, instead of say
>> > Vector{Bool} or BitArray?
>>
>>