If you're not talking about the output of a compiler when you're talking
about analyzing binaries, I'm not sure what you're referring to... Even a
human being writing raw assembly is a bad compiler :P

Also, only our C++ analysis uses debug symbols for assistance (although
considerable work has been done on not requiring them, getting funding and
time to build a generalized decompiler for C++ isn't something that seems
to have the return on investment that one might hope.)

All the other languages seem to come along just fine without them. We
require the upload of debug symbols from our customers to ensure that they
actually do have the code to the program they're analyzing. At the same
time, not everything we analyze (C++ included!) has complete debug symbols,
so we have to get along without them for a good portion of the program's
analysis. (Many times, at least half of the enterprise programs we look at
have some commercial libraries and are submitted without debug symbols for
these parts).

Anyway, it's safe to assume that a compiler will make certain choices about
layout. The trick lies in figuring out what assumptions are safe to make
and which ones are not. Knowing what the compiler must have known at a
certain point in the process lets you get at that. Things like 'calling
conventions' are valid in certain states (external function calls for
example, but not internal calls that are possibly optimized), and compilers
are/should-be deterministic in their output :) Things like 'how vtables are
laid out' and 'how structures are laid out' doesn't really vary, even when
code is optimized. Things like 'base pointer usage' can not be relied upon,
but 'stack pointer usage' is baked into the notion of a 'call stack'.
Getting things like 'varargs detection' right is a giant pain in the rump
but it's doable,  (what does it look like you call a varargs function,
passing in as an argument the return value from another varargs function?
boggle)

Regardless, the high-level model, once recovered, needs to be complete
enough to be searchable in a meaningful way, and that's really where your
original question gets started. So the problem should to be broken down
into "recovering a high level model" from the binary, and "searching that
model for something meaningful". Crossing the streams there is the path to
madness.

Anyway, those two problems have very different profiles, which are you more
interested in tackling ? :)

--chris


On Sun, Mar 8, 2015 at 10:47 AM, Andrew <mu...@mimisbrunnr.net> wrote:

> Your system is more "compiler output analysis" though, because you make
> assumptions about code/data layout as generated by a compiler, and rely
> on the presence of debug information, right? I seem to recall this was
> the case a few years ago but have not been keeping up to date with what
> your system has been doing since.
>
> On 03/08/2015 01:43 PM, Christien Rioux wrote:
> >
> > Veracode does this all day.
> >
> > We've been doing static binary analysis, based on a combined data flow
> > and control flow model, and solving the problem you're discussing using
> > a combination of demand pointer analysis and taint tracking.
> >
> > I've got some slides on the Veracode binary modeling system here:
> >
> https://dl.dropboxusercontent.com/u/458169/Lessons%20Of%20Binary%20Analysis-SEATTLE.pptx
> >
> > This doesn't cover the actual 'DPA' or taint tracking, but rather how to
> > effectively model the binary in the first place. These days, a number of
> > other binary modeling systems exist, but they do not necessarily cover
> > all the things you've stated as problems.
> >
> > Feel free to message privately if you want to know more, or if you think
> > the questions are langsec-related specifically, reply here :)
> >
> > Cheers,
> >
> > --chris
> >
> > On Sun, Mar 8, 2015 at 9:50 AM, Andrew <mu...@mimisbrunnr.net
> > <mailto:mu...@mimisbrunnr.net>> wrote:
> >
> >     This problem is probably considered hard because many have tried and
> >     failed (myself included, yay), or at least have not achieved a close
> >     enough approximation to success as to be satisfying.
> >
> >     1.
> >       a. The work I like the most is this system for doing this analysis
> on
> >     Android bytecode for malware analysis:
> >     http://matt.might.net/papers/liang2013anadroid2.pdf . It might seem
> >     "theory heavy" but they have empirical evaluations that are quite
> good.
> >
> >       b. The Jakstab platform also kind of tackles this, read their paper
> >     set here and check out their code:
> http://www.jakstab.org/documentation
> >
> >       c. The closest we got to publishing our work on this system was to
> >     patent it, it isn't the best reading but it is here:
> >     http://www.google.com/patents/US8533836
> >
> >     2.
> >       If I were going to start this again, I'd build on top of mcsema
> >     (https://github.com/trailofbits/mcsema) or bap
> >     (https://github.com/BinaryAnalysisPlatform/bap) and use backwards
> >     symbolic execution from call sites to network functions. This will
> fail
> >     for a number of reasons:
> >       * Initially, control flow analysis. Neither bap nor mcsema will be
> >     able to deal with malware and self modifying code. You'd probably
> wind
> >     up building a control flow translation system on the front end of the
> >     pipeline to do unpacking, either via concrete or symbolic execution,
> and
> >     storing a new control flow graph as a state-aware control flow graph
> >     (addresses are now tuples of address and time to represent the
> >     mutability of the program). Then things would proceed as normal
> >       * You'll discover that there will be a lot of the program involved
> in
> >     backwards SE that does not contribute to the answer to the original
> >     question. You might try and solve this by computing data and control
> >     dependence and performing slicing, but then you will find that...
> >       * ... the ability to do this analysis hinges on the precision of
> your
> >     alias analysis. You'll try and implement VSA and cry a lot. Even once
> >     correctly implemented, VSA might diverge during analysis.
> >       * The problem will also nest like a Matryoshka doll, programs will
> >     compute the location of the get address / network connect functions
> by,
> >     on a good day, using facilities like dlopen/GetProcAddress and on a
> bad
> >     day by walking data structures in memory. Even identifying the
> locations
> >     where the calls occur will become its own problem equal in
> complexity to
> >     identifying the set of possible arguments. It's a cycle of violence
> and
> >     poverty with no end.
> >      * You'll discover that existing symbolic models of operating systems
> >     are totally inadequate. There will be calls to functions that change
> the
> >     way the program interacts with the outside world in fundamental ways
> and
> >     you'll need to figure out how to represent them (what is the symbolic
> >     effect of LoadLibrary? MapViewOfSection? ReadProcessMemory? ugh)
> >      * Maybe we could use dynamic execution? Ugh that's cheating though!
> >     Also what appliances like FireEye and open sandboxes like cuckoo
> already
> >     do. Oh well, maybe that's good enough? I'll go buy a FireEye.
> >
> >     On 03/08/2015 06:01 AM, john.wil...@pinfosec.com
> >     <mailto:john.wil...@pinfosec.com> wrote:
> >     > Greetings,
> >     >
> >     > I have lurked here long enough. I love this list and the
> perspective it
> >     > brings.
> >     >
> >     > I have a project idea of performing application security
> assessments on
> >     > binaries of unknown or questionable origin using one specific
> objective:
> >     >
> >     > Determining where in the code network calls are performed, then
> tracing
> >     > back through the code to identify the destination address
> (hostname, IP,
> >     > or other).
> >     >
> >     > To me it seems that this is of most value, as any malware intent
> upon
> >     > stealing data or being part of a botnet must communicate via the
> network
> >     > at some point. Surely there are other innovative methods of
> >     > communicating, but I am focused on the network connection.
> >     >
> >     > Some of my security colleagues say that what I want to do is "too
> hard".
> >     > To me, this translates to:
> >     >
> >     >   * It is an important problem to solve
> >     >   * Hard problems are best solved by first properly characterizing
> the
> >     >     problem
> >     >   * Once a hard problem is properly characterized, then solving it
> >     >     becomes much easier
> >     >
> >     > While not directly related to language parsing, this list would
> seem to
> >     > best understand my perspective on the problem.
> >     >
> >     > Assume that the binary is capable of being reversed.
> >     >
> >     > This brings me to my questions for the list:
> >     > 1. Are you aware of anyone else that has tried to do this? If so,
> where
> >     > can I find details?
> >     > 2. Do you have any suggestions on where to start or how to go about
> >     > properly modeling this problem?
> >     > 3. Does anyone have the expertise and interest in pursuing such a
> project?
> >     >
> >     > I would start with various tools that facilitate analysis using
> >     > intermediate representation and control flow graph data...
> >     >
> >     > Thanks,
> >     > John
> >     > LinkedIn.com <http://LinkedIn.com>\in\johnmwillis
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > _______________________________________________
> >     > langsec-discuss mailing list
> >     > langsec-discuss@mail.langsec.org
> >     <mailto:langsec-discuss@mail.langsec.org>
> >     > https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
> >     >
> >     _______________________________________________
> >     langsec-discuss mailing list
> >     langsec-discuss@mail.langsec.org
> >     <mailto:langsec-discuss@mail.langsec.org>
> >     https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
> >
> >
> _______________________________________________
> langsec-discuss mailing list
> langsec-discuss@mail.langsec.org
> https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
>
_______________________________________________
langsec-discuss mailing list
langsec-discuss@mail.langsec.org
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

Reply via email to