In the regex module I use an encoding scheme when pickling pattern objects which is based on the way MIDI encodes variable-length integers, and I think it might have a use in a pickle protocol.

In the basic format, an integer is split up into 7-bit chunks, each chunk put into a byte, and the most-significant bit of the byte used to signal whether the value continues into the following byte.

And integer must be encoded into the minimum number of bytes, so an encoded sequence of bytes would never start with 0x80.

MIDI deviates from the basic idea in order to reduce the number of bytes, so as sequence of bytes in MIDI _can_ start with x080; this is fine for MIDI because it doesn't need to represent negative integers.

The format I'm suggesting uses an initial 0x80 as a way of letting it encode negative integers.

Here are a couple of Python functions that summarise the encoding and decoding (minus obvious optimisations for simplicity):


def encode_varint(value: int) -> List[int]:
    negative = value < 0
    encoded = []

    if negative:
        final = -1
    else:
        final = 0

    while True:
        encoded.append(0x80 | (value & 0x7F))
        value >>= 7

        if value == final:
            break

    if negative:
        encoded.append(0x80)

    encoded.reverse()

    encoded[-1] &= 0x7F

    return encoded


def decode_varint(encoded: Iterable[int]) -> int:
    byte = next(encoded)

    if byte == 0x80:
        value = -1
        byte = next(encoded)
    else:
        value = 0

    value = (value << 7) | (byte & 0x7F)

    while (byte & 0x80) != 0:
        byte = next(encoded)
        value = (value << 7) | (byte & 0x7F)

    return value


The advantage of encoding integers in this way is that there's no limit to their size, so there's no need to add a new protocol to support larger counts.

They can also make pickles smaller.

Example:

    # Pickle (None, )
    0: \x80 PROTO      4
    2: \x95 FRAME      4
   11: N    NONE
   12: \x85 TUPLE1
   13: \x94 MEMOIZE    (as 0)
   14: .    STOP

Here, FRAME takes an argument of 8 bytes. If you replaced FRAME with a version that accepted a variable-length count, you could reduce that argument to 1 byte.

You also wouldn't need to have different fixed-length versions of an opcode, e.g. BINUNICODE and SHORT_BINUNICODE.

Whether you do anything with this is entirely up to the core devs, I just thought someone might find it useful.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to