Also, if you have only 4 19 bit values, you can pack them up tighter _2 (16&#.)\ #. |: ?. 4 19 $ 2 109 59 161 30 136 80 64 129 56 1
Non multiples of 8 would just generally leave some trailing 0s on decode, which could simply be understood as invalid. ----- Original Message ----- From: 'Pascal Jasmin' via Programming <[email protected]> To: "[email protected]" <[email protected]> Cc: Sent: Wednesday, December 31, 2014 11:40 AM Subject: Re: [Jprogramming] variable integer coding (challenge) base64 is essentially taking 3 bytes and storing them as 4. The inverse can take 4 6bit values and pack them into 3 bytes. The J code for base64 is pretty generic to see how to do it for any base. 7 bit values could be packed 8 at a time in 7 bytes. There is a different way to pack 8 x bit values into x bytes. Each byte would store the i'th bit of the 8 values. This may be even faster than base64 type coding. #. |: ?. 8 19 $ 2 104 214 59 183 174 23 31 225 134 149 66 31 85 19 133 1 51 148 28 ----- Original Message ----- From: Joe Bogner <[email protected]> To: [email protected] Cc: Sent: Wednesday, December 31, 2014 10:02 AM Subject: Re: [Jprogramming] variable integer coding (challenge) Can 19 bits be stored in memory or a file on modern computers without taking up 3 bytes (24 bits)? The only idea that came to mind would some bit packing scheme with multiple items. That seems complicated unless there was a need to store billions of these numbers On Dec 31, 2014 9:28 AM, "Joe Bogner" <[email protected]> wrote: > Thanks. I missed the point of the variable breakup point but exploring > different options below illustrated it for me. > > Changing the breakup point can encode the numbers into different bit > lengths (other than 8) which can then be written to the file/memory. > > For example, 1021 encoded in 511 breakups would take 18 bits > > $ #: 511 toVint 1021 > 2 9 > > Whereas encoding it as 255 would require 40 bits, which is more than a > 32-bit int. This is where I was failing to see the point of encoding > larger numbers > > NB. 5 bytes, 40 bits > $ #: 255 toVint 1021 > 5 8 > > Without trying the binary representation, I was expecting the breakup > point to take the same space. For example, below I would have thought > that I'd need to write a > > NB. 6 bytes > 255 toVint 511 toVint 1021 > 255 255 1 255 255 0 > > > If I understand this correctly, I can see how variable ints could > shave off a few bits if the range of numbers is known in advance, > which is often the case. > > > Regarding packed strings I've been thinking about it recently as well. > I would like to see how it could be used to pack a table of strings > (rows and columns). > > The following string could be a packed representation of a 2x2 table > > FrankUSATomGermany > > The binary format could be: > > [# of rows] [# of columns] [row1-col1 length] [row1-col2 length].... > > The header for the string could be > 2 2 5 3 3 7 > > Or maybe if you wanted to support a variable number of columns (unsure > of the usecase of this) > > [# of rows] [row-1 # of columns] [row1-col1 length] [row1-col2 length] > 2 2 5 3 2 3 7 > > I don't know well it'd perform relative to boxed tables or inverted > tables. I would imagine the memory would be reduced for storage but > I'd have to test to see how much memory would be used on different > operations. I'd be interested in the performance of operations like > searching (e.g. find all the rows with a country of Germany) > > It might also be a safer way of storing user-input text without > worrying about escaping delimiters. > > On Tue, Dec 30, 2014 at 1:18 PM, 'Pascal Jasmin' via Programming > <[email protected]> wrote: > > I also provided some adverbs > > > > toVint =: 1 : '[: (({."1 ,. 1:) #&, m ,. {:"1) (0, m) #: ]' > > fromVint=: 1 : 'm&~: +/;.2 ]' > > > > these let you set the breakup point to what you wish > > > > 15 toVint 33 12 17 > > 15 15 3 12 15 2 > > 511 toVint 1233 > > 511 511 211 > > > > 511 fromVint 511 toVint 1233 444 > > 1233 444 > > It can provide simple compression. If you have a list of many strings > where almost all of them will fit into a 6bit alphabet, you can still > encode a full 7 bit or 8bit alphabet. On average, you'd save space using > 6bit alphabet, depending on your context. > > > > The variable string functions I showed, still let you do a substring > search on the packed data. Part of it is disk and memory compression > (don't expand until you need to, and a string takes much less memory than a > list of numbers < 256), but part of it is not being forced in design to > determine what the impossible cutoff number is. > > > > A big advantage related to J is having binary data records. If you > encode the length of the binary data, you no longer have to box them if a > trailing 0 or space would be difficult to know whether it is a fill or > intentional. > > > > Beyond binary data, there is dumb stuff such as storing addresses. Most > people have just 1, Most fit within 255 bytes. Usually no one cares about > address info except when they do or already have the client record. Do you > really want a separate table with 1 to many join with a possible 10000 > character fixed size per address just because you're unsure if it will ever > be possible for you to store an address to an African village as a set of > driving instructions and then a description of the house relative to 3 > other houses and landmarks just in case one of the other houses gets > painted. > > > > Another neat feature about the coding method is that if you have a lot > of inefficient codes, ie. storing 1020 as 255 255 255 255 0, it at least > compresses well. > > > > A minor thing is that the encoding still lets you do search by < or >. > If you are storing the number of orders per customer. The first byte tells > you whether it is > 254 or less than any other number below 255 (probably > enough info to select customers of interest). Another minor thing is that > you don't have to use the encoding codes until there is an overflow. It > may be easier to just switch functions to decode variable data, then to > reshape terrabytes of data. > > ________________________________ > > From: Joe Bogner <[email protected]> > > To: [email protected] > > Sent: Tuesday, December 30, 2014 10:45 AM > > Subject: Re: [Jprogramming] variable integer coding (challenge) > > > > > > Pascal, > > > > Thanks for sharing. Can you also share a potential use case? > > > > I had to remind myself about the size of integers (4 bytes) on 32-bit. > > > > Will this only save space on numbers less than 1019? > > > > NB. 4 bytes > > toVint2 1019 > > 255 255 255 254 > > > > NB. 5 bytes > > toVint2 1020 > > 255 255 255 255 0 > > > > Are you looking for the savings in terms of disk space (saving it to a > > file) or working memory? > > > > > > > > > > > > > > On Mon, Dec 29, 2014 at 12:12 PM, 'Pascal Jasmin' via Programming > > <[email protected]> wrote: > >> A common design decision is how large to make a field. For many > numbers, we think that a byte will be large enough to hold practical > values, but one day we will find out that it is not. This is essentially > the Y2k problem, or year 2032 issue. > >> > >> A simple variable integer coding (tied to byte ranges) > >> > >> toVint =: [: ; (255&| ,~ 255 #~ <.@%&255) each > >> > >> toVint 566 44 > >> 255 255 56 44 > >> > >> It now takes 3 bytes to store 566, and one byte to store 44. > >> > >> the challenge is to write fromVint such that > >> > >> > >> 566 44 -: fromVint 255 255 56 44 > >> > >> as an extra challenge, is there a way to write toVint or fromVint such > that it is faster than using intermediate boxing? > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
