It seems that git objects are zlib-compressed. If you pipe them
through a zlib decompressor (like
/usr/share/doc/libcompress-zlib-perl/examples/filtinf if you have
libcompress-zlib-perl installed), you can see the actual data, which
is formatted pretty simply--it's the words "commit", "blob", "tree" or
"tag", followed by a space and then the size of the object in a
zero-terminated string of numbers, e.g., "commit 404\0".

The problem is that there's no compression header on the files; "file
-z" doesn't look inside the files, because they don't match any of its
compression magic patterns. Unless the way that -z works is changed to
try decompression on every file, I don't think it's possible to detect
git loose objects, simply because they don't have any header data.

http://book.git-scm.com/7_browsing_git_objects.html
http://book.git-scm.com/7_the_packfile.html
http://repo.or.cz/w/git.git?a=blob;f=Documentation/technical/pack-format.txt;h=1803e64e465fa4f8f0fe520fc0fd95d0c9def5bd;hb=HEAD

This is true for the packfile format up until version 1.6 as well; the
newer version of the .idx file has a magic number in the header, but
the older one does not. The .pack file contains 'PACK', a four-byte
version number and a four-byte count of contained objects, but this
partly conflicts with id Software's .PAK file format, which begins
with "PACK" as well.

http://www.gamers.org/dEngine/quake/spec/quake-spec33/qkspec_3.htm

On the other hand, the Quake packs use a little-endian offset
immediately following the magic, the first byte of which is never
zero, while the git packs use a big-endian version number, the first
byte of which is always zero. We can then pretend that the magic for
git packs is 'PACK\0', and this seems to disambiguate it reliably from
id's format.

I believe that this is all the magic that can be applied to git
objects; the formats aren't particularly amenable to it. The only
other thing I can think of is commenting out the VAX COFF sections; I
don't know how common those files are, but they seem to pop up a lot
of false positives. (Also, the README states that "Match of <= 16 bits
are not accepted", and the VAX COFF magics are 16 bits.)

The attached patch applies against current Debian git, and implements
what's described above. It doesn't fix the problem where
zlib-compressed objects are detected as VAX COFF, but it does do the
rest.

Adam Buchbinder
From 4785506b7f3371e49e3f93187597d0e496a52a3a Mon Sep 17 00:00:00 2001
From: Adam Buchbinder <[email protected]>
Date: Tue, 3 Feb 2009 13:27:37 -0500
Subject: [PATCH] Add detection magic for git packs and indexes, making sure it
 doesn't conflict with id Software .PAK files.

---
 debian/patches/00list                   |    1 +
 debian/patches/342-magic-add-git.dpatch |   44 +++++++++++++++++++++++++++++++
 2 files changed, 45 insertions(+), 0 deletions(-)
 create mode 100755 debian/patches/342-magic-add-git.dpatch

diff --git a/debian/patches/00list b/debian/patches/00list
index 82c74bf..6d2dba7 100644
--- a/debian/patches/00list
+++ b/debian/patches/00list
@@ -37,6 +37,7 @@
 339-magic-add-scribus.dpatch
 340-magic-add-selinux.dpatch
 341-magic-add-bzr.dpatch
+342-magic-add-git.dpatch
 901-file-mgc.dpatch
 903-file-localmagic.dpatch
 904-file-make.dpatch
diff --git a/debian/patches/342-magic-add-git.dpatch b/debian/patches/342-magic-add-git.dpatch
new file mode 100755
index 0000000..d517002
--- /dev/null
+++ b/debian/patches/342-magic-add-git.dpatch
@@ -0,0 +1,44 @@
+#! /bin/sh /usr/share/dpatch/dpatch-run
+## 342-magic-add-git.dpatch by Adam Buchbinder <[email protected]>
+##
+## All lines beginning with `## DP:' are a description of the patch.
+## DP: Add detection for git packs and indexes, making sure it doesn't
+## DP: clash with id Software PACK files. (Closes: #509942)
+
+...@dpatch@
+diff -urNad file~/magic/Magdir/games file/magic/Magdir/games
+--- file~/magic/Magdir/games	2009-01-29 16:01:53.000000000 -0500
++++ file/magic/Magdir/games	2009-02-03 13:20:29.000000000 -0500
+@@ -33,6 +33,7 @@
+ # Quake
+ 
+ 0       string  PACK    Quake I or II world or extension
++>8	lelong	>0	\b, %d entries
+ 
+ #0       string  -1\x0a  Quake I demo
+ #>30     string  x        version %.4s
+diff -urNad file~/magic/Magdir/revision file/magic/Magdir/revision
+--- file~/magic/Magdir/revision	2009-01-29 16:01:53.000000000 -0500
++++ file/magic/Magdir/revision	2009-02-03 13:20:29.000000000 -0500
+@@ -12,6 +12,21 @@
+ # From: Josh Triplett <[email protected]>
+ 0	string	#\ v2\ git\ bundle\n	Git bundle
+ 
++# Type: Git pack
++# From: Adam Buchbinder <[email protected]>
++# The actual magic is 'PACK', but that clashes with Doom/Quake packs. However,
++# those have a little-endian offset immediately following the magic 'PACK',
++# the first byte of which is never 0, while the first byte of the Git pack
++# version, since it's a tiny number stored in big-endian format, is always 0.
++0	string	PACK\0		Git pack
++>4	belong	>0		\b, version %d
++>>8	belong	>0		\b, %d objects
++
++# Type: Git pack index
++# From: Adam Buchbinder <[email protected]>
++0	string	\377tOc		Git pack index
++>4	belong	=2		\b, version 2
++
+ # Type:	Mercurial bundles
+ # From:	Seo Sanghyeon <[email protected]>
+ 0	string	HG10		Mercurial bundle,
-- 
1.5.6.3

Reply via email to