I thought I
would write a small summary on an interesting paper that
appeared at ACM CCS. http://www.ecsl.cs.sunysb.edu/tr/TR246.pdf is
a link to the paper ‘Large-Scale Malware Indexing Using
Function-Call Graphs’ which originates from Symantec.
Essentially
the system identifies variants of malware similar to http://www.vxclass.com.
It looks at the call graphs of particular programs and
identifies the similarity to known call graphs of
malware. If the similarity is high, then a malware
variant has been identified. Identifying call graph
similarity was introduced in http://www.f-secure.com/weblog/archives/carrera_erdelyi_VB2004.pdf.
Call graph similarity is found by identifying common
nodes in the graph. These nodes can be identified as
having identical control flow graphs, or having the same
crc32 over the function represented by the node, and any
other ways you can think of. The Symantec paper
identifies that similarity is a function of the graph
edit distance. The edit distance is the number of
operations that must be performed to convert one thing
to another. Incidentally, there are related edit
distances such as tree edit distances, string edit
distances and lots of variations depending on what type
of operations are allowed. In fact, my paper to be
published in January uses one of the edit distances I
just mentioned. It’s a useful concept.
Vxclass
builds graph similarity by the use of a greedy node
matching algorithm. It actually tries to find unique
nodes first so the greedy heuristic isn’t a problem. But
presumably it falls back to the greedy solution once
unique solutions are exhausted. The Symantec paper tries
to find a minimum cost matching using a bipartite graph
matching algorithm based on the Hungarian algorithm.
This is novel.
The other
novel feature with the Symantec paper is the use of
metric trees to speed up searching the call graph
database. A metric space (http://en.wikipedia.org/wiki/Metric_space)
is defined by a distance function between two particular
points or objects . The distance function must have
certain conditions that are true, such as d(x,y) >=
0, d(x,y) == d(y,x) and d(x,x) == 0. Also the triangle
inequality ( http://en.wikipedia.org/wiki/Triangle_inequality) must
hold which is d(x, z) ≤ d(x, y) + d(y, z). If these
conditions are true for all objects or points, then
that’s great, because performing spatial searches on the
data can take advantage of these properties and perform
much faster. A metric tree takes advantage of the
properties of a metric space and can perform spatial
queries identifying similar or nearest neighbours of a
point in that space faster than comparing each point or
element in the space.
Using the
earlier described notion of graph matching to show
distance is approximately the same as the graph edit
distance. The graph edit distance forms a metric
space. The Symantec paper uses vantage point trees (http://en.wikipedia.org/wiki/VP-tree)
but there are other trees such as BK Trees which perform
well also. I seem to recall reading that vp trees are
most suited for when the distance function is a real
number. BK Trees only work on integer distances. I don’t
know why Symantec chose vp trees over bk trees – maybe
someone else can answer? Perhaps there is no significant
difference.
The novelty
of the Symantec paper is using metric trees to speed up
the similarity search.
The final
novel contribution in the Symantec paper is an ordering
each malware by specific features such as the number of
instructions the malware has. The database is arranged
in a b+tree and vp tree structures are kept at the
nodes. Then when indexing a malware, they can cull out
all malware that is not reasonably close to the features
of the input binary, before searching each vp tree in
the b+tree buckets. This is a pretty simple optimisation
which I think can be improved upon than what was
demonstrated in the paper. But Symantec still did a good
job in what they did.
This is a
nice paper overall that improves malware indexing state
of the art. It’s not a revolutionary paper really, but
each area has contributions that improve what has
previously been investigated. The time it takes to
classify a sample is interesting – about 100 seconds for
a database size of 100k malware. They want to have it
scale up to a million malware.
1 RESPONSE SO FAR ↓
Kaczmarek // November 17, 2009 at 8:39 pm |
Great review !
Thanks.