Let me clarify. Vector{Bool} works fine with JLD, it doesn't work well
with extendible HDF5Datasets, which is what I actually want to do.
On Wednesday, August 13, 2014 6:04:46 PM UTC-6, ggggg wrote:
>
> It turns out Vector{Bool} does not work well with JLD. So I played around
> with BitArray, and I figure out that it would be pretty easy to use for my
> purposes. It seems to me that BitArray could be made a bit more useful by
> exporting a reinterpret method. That would certainly make my use case use
> less code, but it could also replace the current implementation of bits. I
> think it makes more sense for bits to return a BitArray than a String
> anyway, since it would be much faster for uses like bits(4)[1]. Would it
> be worth a making a pull request adding something like this in base?
> (Clearly redefining bits would change behavior and break things, so I'm not
> sure how to approach that)
>
> I wrote up some simple example code and it works fine, however it isn't
> actually any faster than the current bits implementation which I found
> surprising. Maybe it would be a bit faster if BitArray were able to be
> constructed from v directly, instead of allocating then immediately
> replacing b.chunks.
> function reinterpret2(::Type{BitArray}, v::Vector{Uint64}, dims=(64,-1))
> dims[2] == -1 && (dims=(64,length(v)))
> # check to make sure the dims are appopriate for length of v
> b = BitArray(dims...)
> b.chunks = v
> b
> end
>
> function reinterpret2(::Type{BitArray}, i::Uint64, dims=64)
> assert(dims <= 64)
> b = BitArray(dims)
> b.chunks = [i]
> b
> end
>
> bits2(i::Uint64) = reinterpret2(BitArray, i)
> bits2(x) = reinterpret2(BitArray,reinterpret(Uint64, x))
>
> testbits(n) = [bits(i)[1] for i=1:n]
> testbits2(n) = Bool[bits2(i)[1] for i=1:n]
>
>
> testbits(1);
> @time testbits(100000);
> testbits2(1);
> @time testbits2(100000);
>
>
>
> On Tuesday, August 5, 2014 11:26:39 PM UTC-6, Simon Kornblith wrote:
>>
>> Assuming you have enough memory to write a BitArray to the JLD file
>> initially, if you later open the JLD file with mmaparrays=true and read
>> it, JLD will mmap the underlying Vector{Uint64} so that pieces are read
>> from the disk as they are accessed. (The actual specifics of how this works
>> is up to the OS, but generally it works well.) In principle you can also
>> modify the BitArray the changes will be saved to the disk, although I'm not
>> sure how well that works since I don't do it in my own code. There is no
>> easy way to resize the BitArray if you do this, though.
>>
>> Simon
>>
>> On Tuesday, August 5, 2014 5:06:16 PM UTC-4, Tim Holy wrote:
>>>
>>> To me it sounds like you've come up with the main options: BitArray or
>>> Array{Bool}. Since a BitArray is, underneath, a Vector{Uint64} with
>>> different
>>> indexing semantics, it seems you could probably come up with a
>>> reasonable way
>>> to update just part of it. But even if you use Array{Bool}, you're
>>> "only"
>>> talking a few hundred megabytes, which is not a catastrophically large.
>>> Also
>>> consider keeping everything in memory; with 100GB of RAM you could store
>>> an
>>> awful lot of selections.
>>>
>>> --Tim
>>>
>>> On Tuesday, August 05, 2014 12:01:58 PM ggggg wrote:
>>> > Hello,
>>> >
>>> > I have an application where I have a few hundred million events, and
>>> I'd
>>> > like to make and work with different selections of sets of those
>>> events.
>>> > The events each have various values associated with them, say for
>>> > simplicity color, timestamp, and loudness. Say one selection includes
>>> all
>>> > the events within 5 minutes after a blue event. Or I want to select
>>> all
>>> > events that aren't above some loudness threshold. I'd like to be able
>>> to
>>> > save these selections in a JLD file for later use on some or all
>>> events. I
>>> > also need to be able update the selections as I observe more events.
>>> >
>>> > My baseline plane it to have an integer associated with each event and
>>> each
>>> > bit in the integer i corresponds to a selection. So bit 1 is true for
>>> > events within 5 minutes and bit 2 is true for events above the
>>> loudness
>>> > threshold. Then for each event's integer I can do bits(i)[1] or
>>> bits(i)[2]
>>> > to figure out if it is included in each selection. That seems like it
>>> would
>>> > be inefficient since bits() returns a string, so I would have to call
>>> > bool(bits(i)[1]). I could use a bitwise mask of some sort like 1&i==0
>>> for
>>> > the first bit and 2&i==0 for the second bit.
>>> >
>>> > A BitArray seems like a decent choice, except that you can only
>>> read/write
>>> > the entire array from a JLD file, rather than just a part of it. That
>>> will
>>> > be inefficient since I'll often want to look at only a small subset of
>>> the
>>> > total events. And every time I want to update for new events, I would
>>> need
>>> > to read the entire BitArray, extend it in memory, then delete the old
>>> JLD
>>> > object and replace it with a new JLD object. It seems plausible I
>>> could
>>> > figure out how to read/write part of a BitArray from a JLD as I've
>>> already
>>> > done some hacking on HDF5.jl, but that could be a large amount of
>>> work.
>>> >
>>> > An Array{Bool} works well with JLD, and seems just as well suited as a
>>> > BitArray. I think it's 8 times bigger than BitArray, and has a
>>> similar
>>> > space ratio to an integer (depending on how many selections I actually
>>> use)
>>> > because bools are stored as 1 byte? I can probably live with that,
>>> although
>>> > again it seems sort of inefficient.
>>> >
>>> > Any advice on how I should go about deciding, or options I hadn't
>>> > considered? Also why does bits() return a string, instead of say
>>> > Vector{Bool} or BitArray?
>>>
>>>