I also provided some adverbs
toVint =: 1 : '[: (({."1 ,. 1:) #&, m ,. {:"1) (0, m) #: ]'
fromVint=: 1 : 'm&~: +/;.2 ]'
these let you set the breakup point to what you wish
15 toVint 33 12 17
15 15 3 12 15 2
511 toVint 1233
511 511 211
511 fromVint 511 toVint 1233 444
1233 444
It can provide simple compression. If you have a list of many strings where
almost all of them will fit into a 6bit alphabet, you can still encode a full 7
bit or 8bit alphabet. On average, you'd save space using 6bit alphabet,
depending on your context.
The variable string functions I showed, still let you do a substring search on
the packed data. Part of it is disk and memory compression (don't expand until
you need to, and a string takes much less memory than a list of numbers < 256),
but part of it is not being forced in design to determine what the impossible
cutoff number is.
A big advantage related to J is having binary data records. If you encode the
length of the binary data, you no longer have to box them if a trailing 0 or
space would be difficult to know whether it is a fill or intentional.
Beyond binary data, there is dumb stuff such as storing addresses. Most people
have just 1, Most fit within 255 bytes. Usually no one cares about address
info except when they do or already have the client record. Do you really want
a separate table with 1 to many join with a possible 10000 character fixed size
per address just because you're unsure if it will ever be possible for you to
store an address to an African village as a set of driving instructions and
then a description of the house relative to 3 other houses and landmarks just
in case one of the other houses gets painted.
Another neat feature about the coding method is that if you have a lot of
inefficient codes, ie. storing 1020 as 255 255 255 255 0, it at least
compresses well.
A minor thing is that the encoding still lets you do search by < or >. If you
are storing the number of orders per customer. The first byte tells you
whether it is > 254 or less than any other number below 255 (probably enough
info to select customers of interest). Another minor thing is that you don't
have to use the encoding codes until there is an overflow. It may be easier to
just switch functions to decode variable data, then to reshape terrabytes of
data.
________________________________
From: Joe Bogner <[email protected]>
To: [email protected]
Sent: Tuesday, December 30, 2014 10:45 AM
Subject: Re: [Jprogramming] variable integer coding (challenge)
Pascal,
Thanks for sharing. Can you also share a potential use case?
I had to remind myself about the size of integers (4 bytes) on 32-bit.
Will this only save space on numbers less than 1019?
NB. 4 bytes
toVint2 1019
255 255 255 254
NB. 5 bytes
toVint2 1020
255 255 255 255 0
Are you looking for the savings in terms of disk space (saving it to a
file) or working memory?
On Mon, Dec 29, 2014 at 12:12 PM, 'Pascal Jasmin' via Programming
<[email protected]> wrote:
> A common design decision is how large to make a field. For many numbers, we
> think that a byte will be large enough to hold practical values, but one day
> we will find out that it is not. This is essentially the Y2k problem, or
> year 2032 issue.
>
> A simple variable integer coding (tied to byte ranges)
>
> toVint =: [: ; (255&| ,~ 255 #~ <.@%&255) each
>
> toVint 566 44
> 255 255 56 44
>
> It now takes 3 bytes to store 566, and one byte to store 44.
>
> the challenge is to write fromVint such that
>
>
> 566 44 -: fromVint 255 255 56 44
>
> as an extra challenge, is there a way to write toVint or fromVint such that
> it is faster than using intermediate boxing?
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm