Sorry to be that guy who lurks on a mailing list and then suddenly drops a book, but I think this is the best place for me to solicit feedback on the following.
Inspired by LANGSEC and Packet in Packet, I have been thinking about methods of encapsulating data within other data such that the encapsulated data cannot be mistaken for other data. My primary goal is to find new classes of error correcting codes that support unambiguous encapsulation. Most methods of channel coding encapsulate data ambiguously. For example, a packet's payload in a wireless communication protocol is encoded and modulated in the same manner as the packet's header. This ambiguity enables a class of attacks exemplified by Packet in Packet. Similarly, data at rest are typically encapsulated in a manner that can be confused with surrounding data or executable code. This ambiguity enables many types of attack in which user supplied data are interpreted as executable code. I hope to identify new ways to encapsulate data in other data using encodings that unambiguously differentiate inner (encapsulated) data from outer data. a simple example: delimited base64 files The csv file format is simpler and safer to parse if each data element in the file is encoded with characters that are never used as (outer) delimiters. A traditional csv file might look as follows: toorcon,october,san diego shmoocon,february,"washington, dc" defcon,july,las vegas A comma is used as a field delimiter, and a newline is used as a record delimiter. Quotation marks are optionally used as text delimiters and in this case serve to prevent the inner comma from being interpreted as an outer comma. If each field is base64 encoded, the same file looks like this: dG9vcmNvbg==,b2N0b2Jlcg==,c2FuIGRpZWdv c2htb29jb24=,ZmVicnVhcnk=,d2FzaGluZ3RvbiwgZGM= ZGVmY29u,anVseQ==,bGFzIHZlZ2Fz Because the comma and newline do not appear in base64, the encapsulation of inner data elements is unambiguous. Outer data consist only of commas and newlines. Inner data are represented only by characters that appear in base64 encoding. This encapsulation avoids the problems associated with escaping and quotation. Any commas or newlines present within a field are unambiguously differentiated from outer commas and newlines because they are base64 encoded. (For a practical standard, I suggest choosing delimiters other than commas and newlines. Because the base64 encoded data are not human readable, there is no reason to invite the application of historical csv parsing variations such as newline encoding. I would like to develop a clear specification for such a file format and am interested in any thoughts you may have regarding delimiter selection and whether or not historical variations in either csv or base64 should be tolerated.) Apart from implementing base64, it wouldn't take much code to parse delimited base64 files. I notice that base64.py on my system is shorter than csv.py despite including base32 and base16 implementations, more lines of comments, main(), and test(). It also has fewer deeply indented lines. There are two occurrences of the word "guess" in function names in csv.py and none in base64.py. I strongly suspect that a more thorough investigation of the relative complexity would agree. Delimited base64 files can encode any 8-bit data while csv files typically may include only plain text. They are nestable because the inner encoding can encode all characters found in both the (outer) delimiters and the (inner) encoded data. Nestability is an interesting property of this scheme that is not found in every scheme for unambiguous encapsulation. Probably the best example of a similar solution is Dan's Interpolique. However, the outer character set (the delimiters) for a delimited base64 file format may be chosen so that it does not overlap at all with the character set of the inner encoding. This makes it easier to prove that malicious inner data can't be interpreted as outer data. the main show: error correcting codes I think that a specification for a base64 delimited file format would be helpful for introducing people to the concept of unambiguous encapsulation, but my main area of interest is in ways that error correcting codes can be used to unambiguously encapsulate data within other data. When data are stored or transmitted over an analog medium, they are typically encoded with an error control code providing some degree of error detection or correction. These codes use codewords of several bits to represent a smaller number of data bits. The additional bits of transmitted information may be used by the receiver of the data to detect or correct bit errors introduced by the medium. It may be beneficial to select codes that include distinct codewords to represent encapsulated data. Consider two codes with 3?bit codewords: The first code encodes an encapsulation flag and one data bit into a 3?bit codeword. The first bit in a codeword is an encapsulation flag. The second bit is the data bit. The third bit is a parity bit (even parity): codeword encapsulation bit data bit parity bit 000 0 (outer) 0 0 011 0 (outer) 1 1 101 1 (inner) 0 1 110 1 (inner) 1 0 Using this code, the data bits: outer 0, outer 1, outer 1, inner 1, inner 1, inner 0, outer 0 would be encoded as the binary sequence: 000011011110110101000. This code provides 1 bit error detection. Any one flipped bit results in an invalid code word at the receiver. The second code also encodes an encapsulation flag and one data bit into a 3?bit codeword. The first bit in a codeword is an encapsulation flag. The second bit is the data bit. The third bit provides isolation: codeword encapsulation bit data bit isolation bit 000 0 (outer) 0 0 010 0 (outer) 1 0 101 1 (inner) 0 1 111 1 (inner) 1 1 This code does not provide error detection. A single flipped bit can result in a valid codeword at the receiver. However, it provides 1 bit isolation. Any one flipped bit is guaranteed to not break encapsulation. In this simple case, the isolation bit is a repetition of the encapsulation bit. Comparing these two codes, it is best to choose the code with error detection. By detecting any one bit error, the receiver is just as able to prevent encapsulation breakage as it would be with the second code. In other words, detection provides isolation, but isolation does not provide detection. I am embarking on a project to identify encoding schemes that are useful for unambiguous encapsulation. In particular, I expect to find longer codes (with codewords of length greater than five bits) that have interesting encapsulation properties. The first class of codes that my project will seek consists of linear block codes that provide more bits of isolation than detection (without squandering an opportunity to provide more detection as above). I also hope to identify additional classes of error correcting codes with encapsulation properties. For each class of codes, I will either prove by construction that they exist, or I will prove by exhaustive search that they do not exist up to a certain codeword length. As far as I know, the property of isolation (guaranteeing encapsulation in the presence of a certain number of bit errors) is a novel way to judge the effectiveness of error control codes. Codes are selected today primarily by the probability of uncorrectable or undetectable bit errors and the efficiency of the code (the number of data bits vs. transmitted symbols). I propose that future selection of error control codes should also consider the probability of bit errors breaking encapsulation. Error control codes are traditionally generated by analytic methods, but they can also be found by brute force search. By implementing brute force methods, my project will produce error control codes with certain properties even if it does not succeed in producing analytic methods for the generation of codes. As a preliminary test, I wrote a Python program to enumerate all possible codes of a given codeword length that have specified properties. For example, it produced the following list of all binary linear block codes including 00000 with five?bit codewords, two data bits (meaning there are four codewords in the code), and a minimum Hamming distance of three (meaning that each codeword differs from each other codeword by at least three bits): 00000, 11001, 10110, 01111 00000, 11010, 10111, 01101 00000, 11011, 10101, 01110 00000, 11011, 10110, 01101 00000, 11100, 10011, 01111 00000, 11100, 10111, 01011 00000, 11100, 11011, 00111 00000, 11101, 10011, 01110 00000, 11101, 10110, 01011 00000, 11101, 11010, 00111 00000, 11110, 10011, 01101 00000, 11110, 10101, 01011 00000, 11110, 11001, 00111 Each of the above lines would be called a (5,2,3) linear block code in the traditional notation of coding theory. One way to think of a code that has the property of isolation is as a pair of complementary codes. Consider the following pair of codes found using the Python program: 00000, 00011, 00101, 01001 01110, 10110, 11010, 11100 Each of these is a (5,2,2) code (five?bit codewords, two data bits, minimum Hamming distance of two), but the codes are complementary in that the minimum Hamming distance from one code to the other code is three. One of the pair could be used to encode inner (encapsulated) data and the other to encode outer data. A single bit error can be detected by the receiver of a message so encoded. For example, if 00001 is received, the receiver would find that the received codeword is not valid (in either code). A pair of bit errors could result in an invalid codeword (e.g. a transmitted 00000 could be received as 11000) but could also result in a valid codeword from the same code (e.g. a transmitted 00000 could be received as 00011). However, it is impossible for a pair of bit errors to result in a codeword that belongs to the other code. A minimum of three bit errors would be required for the receiver to interpret a codeword as belonging to the wrong code, thus breaking encapsulation. In most systems, the likelihood of three out of five bits being flipped is considerably less than the likelihood of one or two flipped bits. Because my preliminary program has already identified such complementary codes with codewords up to five bits long, it is virtually certain that longer codes can be found by a longer running search performed by more efficient software. Longer codes have greater error correction and detection capabilities, and I expect them to have more interesting isolation capabilities as well. tentative conclusion: Unambiguous encapsulation is a useful design pattern that can (should?) be applied in any case where data are encapsulated. I hope that my project will produce codes that are useful in practical applications. At the very least, I hope to promote discussion of unambiguous encapsulation in general. If you have any thoughts about my project, I would love to hear from you!
_______________________________________________ langsec-discuss mailing list firstname.lastname@example.org https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss