wow On Wed, Jul 1, 2009 at 2:20 PM, Laszlo KREKACS<[email protected]> wrote: >> I dont want to compress at all. The 118MB for me is perfect. I only >> want to pack the directory into a file. But not compressing. >> Im thinking about tar or ar. > > Hi! > > I have studied all the available archive and compression options. > Most notably tar[1][2][4][6] and zip file format [3]. > They are the most common archive types. I read also ar (dpkg > and ipkg uses it) and cpio format. So I did my homework, and > made some researches. > > Our requirements: > - no compression (no wasted cpu time) > - random access (no slow waiting time and memory issue) > - readily available module/library for easy of integrating > (best: no additional package is required to install on the phone) > > Tar completely fail at random access, simply it lacks the > table of content, so accessing the last file in the archive > requires reading the whole content before it. > > Zip support accessing each files in the archive, although > it compress the file by default. > > There are dar[5] and xar[7], which meets our random access > criteria. However dar needs to be ported to the device, and > xar is still in development (that means limited python support > for example). > > So I wrote down the most dumb archive fileformat ever;) > When I wrote the specification, I only had one goal: > make it so simple, that everybody can implement it, > so no need to wait for ready-made library. > > It is called KISS fileformat (keep it simple and stupid), > the preferred extension would be filename.kiss > > You can read it here, I also included it (at the end of mail) > for reference: > http://pastebin.com/m608acaeb > > I think it is suitable for our map tile usage. > > What do you think? > > Best regards, > Laszlo > > [1]: http://en.wikipedia.org/wiki/Tar_(file_format) > [2]: http://www.python.org/doc/2.5.2/lib/module-tarfile.html > [3]: http://www.python.org/doc/2.5.2/lib/module-zipfile.html > [4]: http://en.wikipedia.org/wiki/Comparison_of_file_archivers > [5]: http://en.wikipedia.org/wiki/DAR_(Disk_Archiver) > [6]: http://en.wikipedia.org/wiki/Archive_formats > [7]: http://code.google.com/p/xar/ > > KISS archive fileformat specification: > > # KISS archive format (Keep It Simple and Stupid) > > ## General properties > - blocksize: 512 bytes > - only store filename (and directory if any) and content > - first file contains the filenames > - header: start block, end block, position of last block > > ## Overall file structure > [header][filenames][1. file][2. file][3. file] > > ## [header] > [SB][EB][POS] [SB][EB][POS] [SB][EB][POS] etc.. > [ 4][ 4][ 2] [ 4][ 4][ 2] [ 4][ 4][ 2] etc.. > [ header ] [ filenames ] [1. file ] etc.. > > SB (start block): 4 byte > EB (end block): 4 byte > POS (position of last block): 2 byte > > All numbers are stored big-endian. That means most significant bit first. > Example: > 613 dec = 265 hex = \00 \00 \02 \65 (4 bytes) > 130411 dec = 1FD6B hex = \00 \01 \FD \6B (4 bytes) > > Note: > The remaining part of the header block MUST be filled with zero bytes. > You will always have remaining part in the block, simply each file > takes 10 bytes. (512/10 = 51 and 2 bytes left) > > ## [filenames] > UTF-8 text for each filename, delimited with '\n' byte. > The directory structure is preserved too. > [name of 1. file]['\n'][name of 2. file]['\n'][name of 3. file] etc.. > > Some examples: > this is a file.txt > this2.tar.gz > this3.html > images/loller.html > weird_dir/this\/files contains\/several\\ slashes.txt > > Special characters: > '\n': You cant have '\n' character in the filename. It is preserved. > (it is not supported in most filesystems anyway) > '/': directory delimiter. To save directory structure. > '\/': if the filename itself contains an / character > '\\': if the filename itself contains a \ character > > > ## [X. file] > The file content as is. > > > ## FAQ: > Q: Why another archive format? > A: Because it is the most dumb format ever;) > > Q: Why not tar, ar, zip, [name archive type here]? > A: Short answer: widely used archive format are not suited for random access > with no compression. > Long answer: tar: there is no index, reading the last file of the archive > requires reading the whole file before it. > zip: individual files are compressed, which means: > processortime > xar: it would fit the requirements, but it is not widely > supported, and not in every language. > > Q: I use X language does KISS supported there? > A: The fileformat is so simple, it is intented, every programmer > could implement it in "no time". > > Q: Does compression supported? > A: No. But you can compress the whole file, > just like in tar case: filename.kiss.bz2. Use it for file sharing. > > Q: Do advanced features (rights, symlinks, hardlinks, user/group/other) are > preserved? > A: No. It was not the goal of this archive. Although you can implement it, > just > write those informations in the first file. It is not recommended. > > Q: If the original file is not multiple of 512 bytes, how it will look in the > archive, how many bytes will it take? > A: Lets have an example. We have three files: > 768bytes file, 1024 bytes, 2047 bytes > First file (768 bytes) will take two blocks: 2*512 = 1024 bytes > Second file (1024 bytes) will take two blocks too: 2*512 = 1024 bytes > Third file (2047 bytes) will take four blocks: 4*512 = 2048 bytes > Lets name the files: > - "first filename.extension", (24 bytes) > - "second try", (10 bytes) > - "I want a sexy name.txt", (22 bytes) > The [filenames] section: > [24 bytes][1 byte][10 bytes][1 byte][22 bytes] = 58bytes = 1 block > > The header ([SB][EB][POS]): > [00 00 00 00][00 00 00 00][00 31] (this is the header itself, start at the > 0. block, ends at 0. block, and > the header is 50 bytes long. That means > start at the 0. byte and > ends at the at the 49. byte. > (0..49 = 50 bytes, which is 0x31)) > [00 00 00 01][00 00 00 01][00 39] (58 byte long, that means 0..57 bytes > and 57 dec = 0x39 hexa) > [00 00 00 02][00 00 00 03][00 FF] (768-512 = 256, so 0..255 bytes. > 255dec = FF hexa) > [00 00 00 04][00 00 00 05][01 FF] (1024 bytes = 2 blocks, the > second is full) > [00 00 00 06][00 00 00 09][01 FE] ( 2047-(3*512) = 511. 510dec = 1FE hexa) > > > The overall filesize: > [header][filenames][1. file][2. file][3. file] > [ 1 ][ 1 ][ 2 ][ 2 ][ 4 ] = 10 blocks = 5120B = 5kB > > Q: How is it filled the unused part of the block (if the file is > not multiple of 512 bytes) ? > A: It can be random bytes. But should be zero bytes. Or checksum if there is > enough space left (see section "An insane idea for checksums"). > > ## Implementation advices > > 1. Count the files what you want to archive -> you know how much > space is required by the header. 1-49 files requires one block for the > header > (10 bytes for the header, 10 bytes for the filenames section, x*10 bytes > for the files itself. Maximum x is 49 for one block) > 2. dump 0xFF for the header (look at the "tape archiving" to understand why > FF) > 3. Generate the filenames section and write the filenames section. > 4. Attache each file to the archive, and generate the header > on-the-fly (in memory) > 4. Overwrite the header with valid data. > > ## An insane idea for checksums (integrity checking) > > Here is the idea, write the checksum at the remaining space, if the > file is multiple of 512 byte, write two checksums at the next end of file, > if there is no enough space, write it at the next file, and so on. > If each file was multiple of 512 bytes, or there are not enough space > at the end of each files. There will be no checksum for some of the last files > (but it is always better then having no checksums at all). > Which should be rare, but if you are worried about it, you can always add > a new file with all the necessary informations. > > This section is not mandatory for the fileformat. So if you are brave enough, > implement it! If you dont care, no worries. > > CRC32 is 4 bytes (32 bits) long. > > I think 4bytes should be safe enough;) > A little example code in python how to calculate it: > import binascii > > def crc2hex(crc): > res='' > for i in range(4): > t=crc & 0xFF > crc >>= 8 > res='%02X%s' % (t, res) > return res > > if __name__=='__main__': > test='hello world! and Python too ;)' > crc=binascii.crc32(test) > print 'CRC:', crc > hex_str = crc2hex(crc) > print 'CRC in hex:', hex_str > print 'in byte representation: ', hex_str.decode("hex") > > MD5sum is 16bytes long. > > The CRC32 (4bytes) is recommended. It is enough to detect inconsistencies. > > > ## An another insane idea for tape archiving > > Who the hell use tapes these days?;) > So if the first block is filled with FF hexa, it means the header is at the > end of the archive file, the tailer is more right term;) > So when you archive the tape, you cant reverse and > write the header at the beginning of file. > In that case, the header (at the end of file) is in REVERSE order. > So the last 4 bytes tells where the header begins. So no need to search for > it. > Simply read the last 4 bytes, determine where the header > begins(reverse order!), > and read those blocks. Reverse the byte orders, and thats way you can process > it normally. > > _______________________________________________ > Openmoko community mailing list > [email protected] > http://lists.openmoko.org/mailman/listinfo/community >
_______________________________________________ Openmoko community mailing list [email protected] http://lists.openmoko.org/mailman/listinfo/community

