I also provided some adverbs

toVint =: 1 : '[: (({."1 ,. 1:) #&, m ,. {:"1) (0, m) #: ]' 
fromVint=: 1 : 'm&~: +/;.2 ]'

these let you set the breakup point to what you wish

  15 toVint 33 12 17 
15 15 3 12 15 2 
  511 toVint 1233 
511 511 211 

  511 fromVint 511 toVint 1233 444 
1233 444
It can provide simple compression.  If you have a list of many strings where 
almost all of them will fit into a 6bit alphabet, you can still encode a full 7 
bit or 8bit alphabet.  On average, you'd save space using 6bit alphabet, 
depending on your context.

The variable string functions I showed, still let you do a substring search on 
the packed data.  Part of it is disk and memory compression (don't expand until 
you need to, and a string takes much less memory than a list of numbers < 256), 
but part of it is not being forced in design to determine what the impossible 
cutoff number is.

A big advantage related to J is having binary data records.  If you encode the 
length of the binary data, you no longer have to box them if a trailing 0 or 
space would be difficult to know whether it is a fill or intentional.

Beyond binary data, there is dumb stuff such as storing addresses.  Most people 
have just 1,  Most fit within 255 bytes.  Usually no one cares about address 
info except when they do or already have the client record.  Do you really want 
a separate table with 1 to many join with a possible 10000 character fixed size 
per address just because you're unsure if it will ever be possible for you to 
store an address to an African village as a set of driving instructions and 
then a description of the house relative to 3 other houses and landmarks just 
in case one of the other houses gets painted.

Another neat feature about the coding method is that if you have a lot of 
inefficient codes, ie. storing 1020 as 255 255 255 255 0, it at least 
compresses well.

A minor thing is that the encoding still lets you do search by < or >.  If you 
are storing the number of orders per customer.  The first byte tells you 
whether it is > 254 or less than any other number below 255 (probably enough 
info to select customers of interest).  Another minor thing is that you don't 
have to use the encoding codes until there is an overflow.  It may be easier to 
just switch functions to decode variable data, then to reshape terrabytes of 
data.
________________________________
From: Joe Bogner <[email protected]>
To: [email protected] 
Sent: Tuesday, December 30, 2014 10:45 AM
Subject: Re: [Jprogramming] variable integer coding (challenge)


Pascal,

Thanks for sharing. Can you also share a potential use case?

I had to remind myself about the size of integers (4 bytes) on 32-bit.

Will this only save space on numbers less than 1019?

NB. 4 bytes
toVint2 1019
255 255 255 254

NB. 5 bytes
toVint2 1020
255 255 255 255 0

Are you looking for the savings in terms of disk space (saving it to a
file) or working memory?






On Mon, Dec 29, 2014 at 12:12 PM, 'Pascal Jasmin' via Programming
<[email protected]> wrote:
> A common design decision is how large to make a field.  For many numbers, we 
> think that a byte will be large enough to hold practical values, but one day 
> we will find out that it is not.  This is essentially the Y2k problem, or 
> year 2032 issue.
>
> A simple variable integer coding (tied to byte ranges)
>
> toVint =: [: ; (255&| ,~ 255 #~ <.@%&255) each
>
>   toVint 566 44
> 255 255 56 44
>
> It now takes 3 bytes to store 566, and one byte to store 44.
>
> the challenge is to write fromVint such that
>
>
> 566 44 -: fromVint 255 255 56 44
>
> as an extra challenge, is there a way to write toVint or fromVint such that 
> it is faster than using intermediate boxing?
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to