Veracode does this all day.

We've been doing static binary analysis, based on a combined data flow and
control flow model, and solving the problem you're discussing using a
combination of demand pointer analysis and taint tracking.

I've got some slides on the Veracode binary modeling system here:
https://dl.dropboxusercontent.com/u/458169/Lessons%20Of%20Binary%20Analysis-SEATTLE.pptx

This doesn't cover the actual 'DPA' or taint tracking, but rather how to
effectively model the binary in the first place. These days, a number of
other binary modeling systems exist, but they do not necessarily cover all
the things you've stated as problems.

Feel free to message privately if you want to know more, or if you think
the questions are langsec-related specifically, reply here :)

Cheers,

--chris

On Sun, Mar 8, 2015 at 9:50 AM, Andrew <mu...@mimisbrunnr.net> wrote:

> This problem is probably considered hard because many have tried and
> failed (myself included, yay), or at least have not achieved a close
> enough approximation to success as to be satisfying.
>
> 1.
>   a. The work I like the most is this system for doing this analysis on
> Android bytecode for malware analysis:
> http://matt.might.net/papers/liang2013anadroid2.pdf . It might seem
> "theory heavy" but they have empirical evaluations that are quite good.
>
>   b. The Jakstab platform also kind of tackles this, read their paper
> set here and check out their code: http://www.jakstab.org/documentation
>
>   c. The closest we got to publishing our work on this system was to
> patent it, it isn't the best reading but it is here:
> http://www.google.com/patents/US8533836
>
> 2.
>   If I were going to start this again, I'd build on top of mcsema
> (https://github.com/trailofbits/mcsema) or bap
> (https://github.com/BinaryAnalysisPlatform/bap) and use backwards
> symbolic execution from call sites to network functions. This will fail
> for a number of reasons:
>   * Initially, control flow analysis. Neither bap nor mcsema will be
> able to deal with malware and self modifying code. You'd probably wind
> up building a control flow translation system on the front end of the
> pipeline to do unpacking, either via concrete or symbolic execution, and
> storing a new control flow graph as a state-aware control flow graph
> (addresses are now tuples of address and time to represent the
> mutability of the program). Then things would proceed as normal
>   * You'll discover that there will be a lot of the program involved in
> backwards SE that does not contribute to the answer to the original
> question. You might try and solve this by computing data and control
> dependence and performing slicing, but then you will find that...
>   * ... the ability to do this analysis hinges on the precision of your
> alias analysis. You'll try and implement VSA and cry a lot. Even once
> correctly implemented, VSA might diverge during analysis.
>   * The problem will also nest like a Matryoshka doll, programs will
> compute the location of the get address / network connect functions by,
> on a good day, using facilities like dlopen/GetProcAddress and on a bad
> day by walking data structures in memory. Even identifying the locations
> where the calls occur will become its own problem equal in complexity to
> identifying the set of possible arguments. It's a cycle of violence and
> poverty with no end.
>  * You'll discover that existing symbolic models of operating systems
> are totally inadequate. There will be calls to functions that change the
> way the program interacts with the outside world in fundamental ways and
> you'll need to figure out how to represent them (what is the symbolic
> effect of LoadLibrary? MapViewOfSection? ReadProcessMemory? ugh)
>  * Maybe we could use dynamic execution? Ugh that's cheating though!
> Also what appliances like FireEye and open sandboxes like cuckoo already
> do. Oh well, maybe that's good enough? I'll go buy a FireEye.
>
> On 03/08/2015 06:01 AM, john.wil...@pinfosec.com wrote:
> > Greetings,
> >
> > I have lurked here long enough. I love this list and the perspective it
> > brings.
> >
> > I have a project idea of performing application security assessments on
> > binaries of unknown or questionable origin using one specific objective:
> >
> > Determining where in the code network calls are performed, then tracing
> > back through the code to identify the destination address (hostname, IP,
> > or other).
> >
> > To me it seems that this is of most value, as any malware intent upon
> > stealing data or being part of a botnet must communicate via the network
> > at some point. Surely there are other innovative methods of
> > communicating, but I am focused on the network connection.
> >
> > Some of my security colleagues say that what I want to do is "too hard".
> > To me, this translates to:
> >
> >   * It is an important problem to solve
> >   * Hard problems are best solved by first properly characterizing the
> >     problem
> >   * Once a hard problem is properly characterized, then solving it
> >     becomes much easier
> >
> > While not directly related to language parsing, this list would seem to
> > best understand my perspective on the problem.
> >
> > Assume that the binary is capable of being reversed.
> >
> > This brings me to my questions for the list:
> > 1. Are you aware of anyone else that has tried to do this? If so, where
> > can I find details?
> > 2. Do you have any suggestions on where to start or how to go about
> > properly modeling this problem?
> > 3. Does anyone have the expertise and interest in pursuing such a
> project?
> >
> > I would start with various tools that facilitate analysis using
> > intermediate representation and control flow graph data...
> >
> > Thanks,
> > John
> > LinkedIn.com <http://LinkedIn.com>\in\johnmwillis
> >
> >
> >
> >
> >
> > _______________________________________________
> > langsec-discuss mailing list
> > langsec-discuss@mail.langsec.org
> > https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
> >
> _______________________________________________
> langsec-discuss mailing list
> langsec-discuss@mail.langsec.org
> https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
>
_______________________________________________
langsec-discuss mailing list
langsec-discuss@mail.langsec.org
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

Reply via email to