Re: A space-efficient, user-friendly way to store categorical data

Joe Conway Mon, 12 Feb 2018 08:32:00 -0800

On 02/11/2018 10:06 PM, Thomas Munro wrote:
> On Mon, Feb 12, 2018 at 12:24 PM, Andrew Dunstan
> <andrew.duns...@2ndquadrant.com> wrote:
>> On Mon, Feb 12, 2018 at 9:10 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> Andrew Kane <and...@chartkick.com> writes:
>>>> A better option could be a new "dynamic enum" type, which would have
>>>> similar storage requirements as an enum, but instead of labels being
>>>> declared ahead of time, they would be added as data is inserted.
>>>
>>> You realize, of course, that it's possible to add labels to an enum type
>>> today.  (Removing them is another story.)
>>>
>>> You haven't explained exactly what you have in mind that is going to be
>>> able to duplicate the advantages of the current enum implementation
>>> without its disadvantages, so it's hard to evaluate this proposal.
>>>
>>
>>
>> This sounds rather like the idea I have been tossing around in my head
>> for a while, and in sporadic discussions with a few people, for a
>> dictionary object. The idea is to have an append-only list of labels
>> which would not obey transactional semantics, and would thus help us
>> avoid the pitfalls of enums - there wouldn't be any rollback of an
>> addition.  The use case would be for a jsonb representation which
>> would replace object keys with the oid value of the corresponding
>> dictionary entry rather like enums now. We could have a per-table
>> dictionary which in most typical json use cases would be very small,
>> and we know from some experimental data that the compression in space
>> used from such a change would often be substantial.
>>
>> This would have to be modifiable dynamically rather than requiring
>> explicit additions to the dictionary, to be of practical use for the
>> jsonb case, I believe.
>>
>> I hadn't thought about this as a sort of super enum that was usable
>> directly by users, but it makes sense.
>>
>> I have no idea how hard or even possible it would be to implement.
> 
> I have had thoughts over the years about something similar, but going
> the other way and hiding it from the end user.  If you could declare a
> column to have a special compressed property (independently of the
> type) then it could either automatically maintain a dictionary, or at
> least build a new dictionary for your when you next run some kind of
> COMPRESS operation.  There would be no user visible difference except
> footprint.  In ancient DB2 they had a column property along those
> lines called "VALUE COMPRESSION" (they also have a row-level version,
> and now they have much more advanced kinds of adaptive compression
> that I haven't kept up with).  In some ways it'd be a bit like toast
> with shared entries, but I haven't seriously looked into how such a
> thing might be implemented.


For what it is worth, there is a similar concept in R called "factors".
When categorical data is stored in a data.frame (R language equivalent
to relations) it is transparently and automatically converted. I believe
this is both for storage compression and to facilitate some of the
analytics. In R you can also explicitly specify to *not* convert strings
to factors as a performance optimization, because that conversion does
have a noticeable impact during ingestion and is not always needed.

I can also envision usefulness of this type of mechanism in other
security related scenarios.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

signature.asc
Description: OpenPGP digital signature

Re: A space-efficient, user-friendly way to store categorical data

Reply via email to