I was not thinking of encoding a whole table, into a binary field, but that
could be done as you described too.
My specific use case is storing 2 public cryptographic keys per user, along
with a version number that indicates key size and algorithm involved. That
version number is almost surely going to stay below 256 (maybe under 16 for a
long time), but its not impossible that it grows. The keys are most naturally
large extended numbers, but a byte representation is also natural. This is
part of an inverted table record structure. Because the version numbers are
small, and the extended numbers large, it would make sense that they are each
in their own inverted table box for space reasons
A reasonable alternative would be to have a list of 2 integers for the version
number
3 : '7!:5 < ''y''' 10000 2 $ i.256
262144
3 : '7!:5 < ''y''' 10000 2 $ {&a. i.256
32768
for 10k records, encoding 2 extended numbers 200 to 300 bits long as strings,
save 4.5MB. A 5x ratio stays for larger bit sizes too.
3 : '7!:5 < ''y''' {&a. 256 #.inv 2x ^ 100 + ?. 10000 2 $ 200
5.56723e6
3 : '7!:5 < ''y''' {&a. 256 #.inv 2x ^ 100 + ?. 10000 2 $ 200
1.04858e6
The shape of this is not very table-like though
$ {&a. 256 #.inv 2x ^ 100 + ?. 5 2 $ 200
5 2 38
So, since the versions and the keys have to be accessed together (the keys are
essentially conditional data with conditional size and meaning), it makes sense
to at least explore packing them all together into a binary object into its own
inverted table box. There is still ways to pick out parts of the datastructure
without unpacking all of it as well.
Another way to pack any data structure into a binary object would be
compress_zlib lr data (or
compress_zlib 3!:1 data). It should compress well if the data is all digits,
but you lose the ability to pick out parts of the data without unpacking it all.
----- Original Message -----
From: Joe Bogner <[email protected]>
To: [email protected]
Cc:
Sent: Wednesday, December 31, 2014 9:28 AM
Subject: Re: [Jprogramming] variable integer coding (challenge)
Thanks. I missed the point of the variable breakup point but exploring
different options below illustrated it for me.
Changing the breakup point can encode the numbers into different bit
lengths (other than 8) which can then be written to the file/memory.
For example, 1021 encoded in 511 breakups would take 18 bits
$ #: 511 toVint 1021
2 9
Whereas encoding it as 255 would require 40 bits, which is more than a
32-bit int. This is where I was failing to see the point of encoding
larger numbers
NB. 5 bytes, 40 bits
$ #: 255 toVint 1021
5 8
Without trying the binary representation, I was expecting the breakup
point to take the same space. For example, below I would have thought
that I'd need to write a
NB. 6 bytes
255 toVint 511 toVint 1021
255 255 1 255 255 0
If I understand this correctly, I can see how variable ints could
shave off a few bits if the range of numbers is known in advance,
which is often the case.
Regarding packed strings I've been thinking about it recently as well.
I would like to see how it could be used to pack a table of strings
(rows and columns).
The following string could be a packed representation of a 2x2 table
FrankUSATomGermany
The binary format could be:
[# of rows] [# of columns] [row1-col1 length] [row1-col2 length]....
The header for the string could be
2 2 5 3 3 7
Or maybe if you wanted to support a variable number of columns (unsure
of the usecase of this)
[# of rows] [row-1 # of columns] [row1-col1 length] [row1-col2 length]
2 2 5 3 2 3 7
I don't know well it'd perform relative to boxed tables or inverted
tables. I would imagine the memory would be reduced for storage but
I'd have to test to see how much memory would be used on different
operations. I'd be interested in the performance of operations like
searching (e.g. find all the rows with a country of Germany)
It might also be a safer way of storing user-input text without
worrying about escaping delimiters.
On Tue, Dec 30, 2014 at 1:18 PM, 'Pascal Jasmin' via Programming
<[email protected]> wrote:
> I also provided some adverbs
>
> toVint =: 1 : '[: (({."1 ,. 1:) #&, m ,. {:"1) (0, m) #: ]'
> fromVint=: 1 : 'm&~: +/;.2 ]'
>
> these let you set the breakup point to what you wish
>
> 15 toVint 33 12 17
> 15 15 3 12 15 2
> 511 toVint 1233
> 511 511 211
>
> 511 fromVint 511 toVint 1233 444
> 1233 444
> It can provide simple compression. If you have a list of many strings where
> almost all of them will fit into a 6bit alphabet, you can still encode a full
> 7 bit or 8bit alphabet. On average, you'd save space using 6bit alphabet,
> depending on your context.
>
> The variable string functions I showed, still let you do a substring search
> on the packed data. Part of it is disk and memory compression (don't expand
> until you need to, and a string takes much less memory than a list of numbers
> < 256), but part of it is not being forced in design to determine what the
> impossible cutoff number is.
>
> A big advantage related to J is having binary data records. If you encode
> the length of the binary data, you no longer have to box them if a trailing 0
> or space would be difficult to know whether it is a fill or intentional.
>
> Beyond binary data, there is dumb stuff such as storing addresses. Most
> people have just 1, Most fit within 255 bytes. Usually no one cares about
> address info except when they do or already have the client record. Do you
> really want a separate table with 1 to many join with a possible 10000
> character fixed size per address just because you're unsure if it will ever
> be possible for you to store an address to an African village as a set of
> driving instructions and then a description of the house relative to 3 other
> houses and landmarks just in case one of the other houses gets painted.
>
> Another neat feature about the coding method is that if you have a lot of
> inefficient codes, ie. storing 1020 as 255 255 255 255 0, it at least
> compresses well.
>
> A minor thing is that the encoding still lets you do search by < or >. If
> you are storing the number of orders per customer. The first byte tells you
> whether it is > 254 or less than any other number below 255 (probably enough
> info to select customers of interest). Another minor thing is that you don't
> have to use the encoding codes until there is an overflow. It may be easier
> to just switch functions to decode variable data, then to reshape terrabytes
> of data.
> ________________________________
> From: Joe Bogner <[email protected]>
> To: [email protected]
> Sent: Tuesday, December 30, 2014 10:45 AM
> Subject: Re: [Jprogramming] variable integer coding (challenge)
>
>
> Pascal,
>
> Thanks for sharing. Can you also share a potential use case?
>
> I had to remind myself about the size of integers (4 bytes) on 32-bit.
>
> Will this only save space on numbers less than 1019?
>
> NB. 4 bytes
> toVint2 1019
> 255 255 255 254
>
> NB. 5 bytes
> toVint2 1020
> 255 255 255 255 0
>
> Are you looking for the savings in terms of disk space (saving it to a
> file) or working memory?
>
>
>
>
>
>
> On Mon, Dec 29, 2014 at 12:12 PM, 'Pascal Jasmin' via Programming
> <[email protected]> wrote:
>> A common design decision is how large to make a field. For many numbers, we
>> think that a byte will be large enough to hold practical values, but one day
>> we will find out that it is not. This is essentially the Y2k problem, or
>> year 2032 issue.
>>
>> A simple variable integer coding (tied to byte ranges)
>>
>> toVint =: [: ; (255&| ,~ 255 #~ <.@%&255) each
>>
>> toVint 566 44
>> 255 255 56 44
>>
>> It now takes 3 bytes to store 566, and one byte to store 44.
>>
>> the challenge is to write fromVint such that
>>
>>
>> 566 44 -: fromVint 255 255 56 44
>>
>> as an extra challenge, is there a way to write toVint or fromVint such that
>> it is faster than using intermediate boxing?
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm