I am less interested in the choice of names than the pro and con of when these 
Roaring bitmaps are worth using and when they are not.

It is a bit like discussing whether various compression techniques are worth 
using as the storage or memory costs can be weighed against the CPU or 
transient memory costs of compressing and uncompressing.  The value tends to 
depend on many factors and there may even be times you want to store the data 
in multiple data structures with each optimized to store that kind or amount of 
data.

Roaring bitmaps claim to be an improvement not only over uncompressed 
structures but some other compressed versions but my reading shows it may be 
limited to some uses. Bitsets in general seem to be useful only for a largely 
contiguous set of integers where each sequential bit represents whether the nth 
integer above the lowest is in the set or not. Of course, properly set up, this 
makes Unions and Intersections and some other operations fairly efficient. But 
sets are not the same as dictionaries and often you are storing other data 
types than smaller integers.

Many normal compression techniques can require lots of time to uncompress to 
find anything. My impression is that Roaring Bitmaps is a tad like the pkzip 
software that tries various compression techniques on each file and chooses 
whatever seems to work better on each one. That takes extra time when zipping, 
but when unzipping a file, it goes directly to the method used to compress it 
as the type is in a header and just expands it one way.

My impression is that Roaring bitmaps breaks up the range of integers into 
smaller chunks and depending on what is being stored in that chunk, may leave 
it as an uncompressed bitmap, or a list of the sparse contents, or other 
storage methods and can search each version fairly quickly. 

So, I have no doubt it works great for some applications such as treating 
social security numbers as integers. It likely would be overkill to store 
something like the components of an IP address between 0 and 255 inclusive.

But having said that, there may well be non-integer data that can be mapped 
into and out of integers. As an example, airports or radio stations have names 
like LAX or WPIX. If you limit yourself to ASCII letters then every one of them 
can be stored as a 32-bit integer, perhaps with some padding. Of course for 
such fairly simple data, some might choose to place the data in a balanced tree 
structure and get reasonable search speed.

I am curious about the size of some of these structures but obviously it 
depends. Are they stored on disk in this form too?


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail....@python.org> On 
Behalf Of MRAB
Sent: Thursday, February 16, 2023 11:24 PM
To: python-list@python.org
Subject: Re: Comparing caching strategies

On 2023-02-14 22:20, Rob Cliffe via Python-list wrote:
> On 11/02/2023 00:39, Dino wrote:
>> First off, a big shout out to Peter J. Holzer, who mentioned roaring 
>> bitmaps a few days ago and led me to quite a discovery.
>>
> I was intrigued to hear about roaring bitmaps and discover they really 
> were a thing (not a typo as I suspected at first).
> What next, I wonder?
>       argumentative arrays
>       chattering classes (on second thoughts, we have those already)
>       dancing dictionaries
>       garrulous generators
>       laughing lists
>       piping pipelines
>       singing strings
>       speaking sets
>       stuttering sorts
>       talking tuples
>       whistling walruses?

babbling bytestrings?

> The future awaits [pun not intended] ...

--
https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to