Ah, but to your #2, this is the entire point of the system that the OP
proposes. I'm going to assume that if a program is analyzed by this
system we already know it is malware, and we want to know more (well
usually it is the customer that wants to know more, a quicker path to
success might be to drop the customer...)

There are almost certainly advanced things you can do in the face of
complicated hostname generation procedures - using counterexample driven
abstraction refinement to automatically synthesize a DGA from a binary
would be fun (but I wouldn't be surprised if someone has done or tried
this before). Perhaps these are inherently academic in nature, but
perhaps not? I'm not sure. I've known very practically minded people to
value information like that before, and techniques to assist in the
acquisition of that information could also be of help.

One thing we tried to do with mcsema was to leave the door open to
interactive approaches as much as possible. We have some integration
with IDA, in that IDAPython scripts can extract information about the
program from IDA and pass it to mcsema which will then translate that
program into LLVM. I believe similar types of cooperation are possible
now with bap and IDA. Also, Axel's group at Munich open-source released
their bindead binary analysis platform:
https://bitbucket.org/mihaila/bindead/wiki/Home with their GUI frontend
you could also bridge some between a human analyst and mechanical static
analysis.

On 03/08/2015 03:23 PM, Christien Rioux wrote:
> 
> 0. Neat example with the data-driven pointer/value abstraction. Another
> example of something similar is simply determining if the instruction
> you're looking at is encoded as ARM or THUMB, which is decoded based on
> the setting of a bit in the CPU, that some other instruction can change
> at any point. 
> 
> The question of whether or not the compiler in question (or the
> interpreting system) would allow something that is a 'pointer' or an
> 'integer' at the same instruction (or code path), is required knowledge
> to interpret this one. If it's totally data driven, then yea, you'd have
> to consider a type-voting mechanism to determine the likelihood that
> something at a particular reference is a 'value, a pointer, or either,
> or unknown'. My guess is semantics around the propagative path to the
> point of reference in many cases could determine if something is a value
> or pointer or both without an exhaustive program search many times.
> Avoiding 'exhautive searches' at all costs and simply propagating
> 'unknown' at the point of resource exhaustion is practical here...
> 
> 1. Our taint tracking is bidirectional and (given a useful slice)
> propagated all over the place. Taint 'sources' are propagated forward,
> and and taint 'sinks' are propagated backward. It does get tricky doing
> this meaningfully over a wide berth of interprocedural activity, and you
> do have to consider SMT-style path conditional requirements (what things
> are known true to get this value over this path, reachability given
> dataflow contraints) but generally it does help with the 'does untrusted
> input reach something that requires trusted input' question.
> 
> 2. If something is not calculable in this process, due to some kind of
> funky 'obfuscation', typically we find that it's malware. The very fact
> that a 'connect' call is being made to something that isn't obvious, or
> comes from an obvious source, but rather some lengthy calculation, is
> usually a malware signature in its own right, at which point, it's an
> academic question of whether or not you care what it is 'actually doing'
> rather than simply the fact that you really don't know what it is doing. 
> 
> Cheers,
> 
> --chris
> 
> On Sun, Mar 8, 2015 at 11:45 AM, Andrew <mu...@mimisbrunnr.net
> <mailto:mu...@mimisbrunnr.net>> wrote:
> 
>     In malware, sometimes assumptions about these conventions hold, and
>     sometimes they don't :( It can make recovery of a high level model
>     challenging. You're definitely right about breaking the problem down
>     into a high level model and then analysis of that high level model. I
>     still think that the problem you have to solve to analyze the high level
>     model requires the same technique as to generate the high level model.
>     So my sense is that in general, this type of analysis is repeated
>     iteration of an analysis routine towards convergence on a fixed point,
>     but there is no upper bound on the number of iterations and this is
>     where sadness enters. That might be too pessimistic.
> 
>     One example of challenges in recovery is analyzing native haskell or
>     ocaml code. Once you know the model it gets "real world easy" since
>     there are a limited number of ocaml native compilers, but until you
>     incorporate that things make no sense. The runtime that the code
>     generator creates encodes pointers and integers by tagging the low bit.
>     If the low bit is set, the value is a pointer. So the nil pointer is the
>     constant 1, not the constant 0, which can make certain type of inference
>     operations more interesting, because iterative list traversals now test
>     cur = 1 instead of cur = 0. Once you know the trick this is less
>     challenging, but with "binary" analysis, especially adversarial binary
>     analysis, there can be a bottomless abyss of these tricks for you to
>     discover one by one.
> 
>     Other languages use CPS-style code generation that is nothing like code
>     generated by C compilers (at least compiled ocaml is generally quite
>     imperative). For example, a program where everything you would think of
>     as call or return is implemented via "jmp [eax]".
> 
>     So producing this high level model for arbitrary programs is, I think,
>     quite challenging and is something that I've been working on, with
>     varying degrees of success. There are other problems too, like
>     adversarial compilers using undefined flag values for control decisions
>     and basing program logic off of undefined or undocumented CPU behavior.
>     These are things that "compilers" don't do, but "binaries" can do.
> 
>     I'm also not sure that taint tracking is the tool to solve the problem
>     from the OP, at least in a forward direction (which is usually how I've
>     seen taint tracking expressed? I guess you could have backwards taint
>     tracking). You want to produce a set of possible values for a given call
>     to connect() (and probably also link those to calls to gethostbyname()
>     and produce a set of possible values for calls to that function as
>     well). So what values do you taint?
> 
>     The string content for the calls to gethostbyname() will almost
>     certainly, in the adversarial case, be obfuscated, often in ways that
>     are not as simple to analyze as
> 
>     char *thec2host = deobfuscate(obfuscated_blob);
> 
>     where deobfuscate : char * -> char * and is pure. So some degree of
>     symbolic analysis and simplification will be required to provide answers
>     to the user about what the actual value of the string being passed is.
> 
>     I think that in my original assessment I was too pessimistic, possibly
>     from remembering the years of my life that have gone into tackling this
>     problem so far. I think that progress could definitely be made now on
>     crossing a 50% threshold using static analysis techniques, mostly
>     backwards SE and slicing. Going further than that will suck though, and
>     theoretical results tell us that once the adversary becomes wise to this
>     technique they have access to an infinite amount of cleverness to attack
>     the system with, and the system won't have a generic way to respond.
>     This is also a path to {sad|mad}ness.
> 
>     On 03/08/2015 02:17 PM, Christien Rioux wrote:
>     > If you're not talking about the output of a compiler when you're
>     talking
>     > about analyzing binaries, I'm not sure what you're referring to...
>     Even
>     > a human being writing raw assembly is a bad compiler :P
>     >
>     > Also, only our C++ analysis uses debug symbols for assistance
>     (although
>     > considerable work has been done on not requiring them, getting funding
>     > and time to build a generalized decompiler for C++ isn't something
>     that
>     > seems to have the return on investment that one might hope.)
>     >
>     > All the other languages seem to come along just fine without them. We
>     > require the upload of debug symbols from our customers to ensure that
>     > they actually do have the code to the program they're analyzing.
>     At the
>     > same time, not everything we analyze (C++ included!) has complete
>     debug
>     > symbols, so we have to get along without them for a good portion
>     of the
>     > program's analysis. (Many times, at least half of the enterprise
>     > programs we look at have some commercial libraries and are submitted
>     > without debug symbols for these parts).
>     >
>     > Anyway, it's safe to assume that a compiler will make certain choices
>     > about layout. The trick lies in figuring out what assumptions are safe
>     > to make and which ones are not. Knowing what the compiler must have
>     > known at a certain point in the process lets you get at that. Things
>     > like 'calling conventions' are valid in certain states (external
>     > function calls for example, but not internal calls that are possibly
>     > optimized), and compilers are/should-be deterministic in their
>     output :)
>     > Things like 'how vtables are laid out' and 'how structures are
>     laid out'
>     > doesn't really vary, even when code is optimized. Things like 'base
>     > pointer usage' can not be relied upon, but 'stack pointer usage' is
>     > baked into the notion of a 'call stack'. Getting things like 'varargs
>     > detection' right is a giant pain in the rump but it's doable,  (what
>     > does it look like you call a varargs function, passing in as an
>     argument
>     > the return value from another varargs function? boggle)
>     >
>     > Regardless, the high-level model, once recovered, needs to be complete
>     > enough to be searchable in a meaningful way, and that's really where
>     > your original question gets started. So the problem should to be
>     broken
>     > down into "recovering a high level model" from the binary, and
>     > "searching that model for something meaningful". Crossing the streams
>     > there is the path to madness.
>     >
>     > Anyway, those two problems have very different profiles, which are you
>     > more interested in tackling ? :)
>     >
>     > --chris
>     >
>     >
>     > On Sun, Mar 8, 2015 at 10:47 AM, Andrew <mu...@mimisbrunnr.net
>     <mailto:mu...@mimisbrunnr.net>
>     > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>>> wrote:
>     >
>     >     Your system is more "compiler output analysis" though, because you 
> make
>     >     assumptions about code/data layout as generated by a compiler, and 
> rely
>     >     on the presence of debug information, right? I seem to recall this 
> was
>     >     the case a few years ago but have not been keeping up to date with 
> what
>     >     your system has been doing since.
>     >
>     >     On 03/08/2015 01:43 PM, Christien Rioux wrote:
>     >     >
>     >     > Veracode does this all day.
>     >     >
>     >     > We've been doing static binary analysis, based on a combined data 
> flow
>     >     > and control flow model, and solving the problem you're discussing 
> using
>     >     > a combination of demand pointer analysis and taint tracking.
>     >     >
>     >     > I've got some slides on the Veracode binary modeling system here:
>     >     > 
> https://dl.dropboxusercontent.com/u/458169/Lessons%20Of%20Binary%20Analysis-SEATTLE.pptx
>     >     >
>     >     > This doesn't cover the actual 'DPA' or taint tracking, but rather 
> how to
>     >     > effectively model the binary in the first place. These days, a 
> number of
>     >     > other binary modeling systems exist, but they do not necessarily 
> cover
>     >     > all the things you've stated as problems.
>     >     >
>     >     > Feel free to message privately if you want to know more, or if 
> you think
>     >     > the questions are langsec-related specifically, reply here :)
>     >     >
>     >     > Cheers,
>     >     >
>     >     > --chris
>     >     >
>     >     > On Sun, Mar 8, 2015 at 9:50 AM, Andrew <mu...@mimisbrunnr.net 
> <mailto:mu...@mimisbrunnr.net>
>     <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>>
>     >     > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>
>     <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>>>> wrote:
>     >     >
>     >     >     This problem is probably considered hard because many have
>     >     tried and
>     >     >     failed (myself included, yay), or at least have not
>     achieved a
>     >     close
>     >     >     enough approximation to success as to be satisfying.
>     >     >
>     >     >     1.
>     >     >       a. The work I like the most is this system for doing this
>     >     analysis on
>     >     >     Android bytecode for malware analysis:
>     >     >     http://matt.might.net/papers/liang2013anadroid2.pdf . It
>     might
>     >     seem
>     >     >     "theory heavy" but they have empirical evaluations that are
>     >     quite good.
>     >     >
>     >     >       b. The Jakstab platform also kind of tackles this, read
>     >     their paper
>     >     >     set here and check out their code:
>     >     http://www.jakstab.org/documentation
>     >     >
>     >     >       c. The closest we got to publishing our work on this
>     system
>     >     was to
>     >     >     patent it, it isn't the best reading but it is here:
>     >     >     http://www.google.com/patents/US8533836
>     >     >
>     >     >     2.
>     >     >       If I were going to start this again, I'd build on top
>     of mcsema
>     >     >     (https://github.com/trailofbits/mcsema) or bap
>     >     >     (https://github.com/BinaryAnalysisPlatform/bap) and use
>     backwards
>     >     >     symbolic execution from call sites to network functions.
>     This
>     >     will fail
>     >     >     for a number of reasons:
>     >     >       * Initially, control flow analysis. Neither bap nor mcsema
>     >     will be
>     >     >     able to deal with malware and self modifying code. You'd
>     >     probably wind
>     >     >     up building a control flow translation system on the
>     front end
>     >     of the
>     >     >     pipeline to do unpacking, either via concrete or symbolic
>     >     execution, and
>     >     >     storing a new control flow graph as a state-aware
>     control flow
>     >     graph
>     >     >     (addresses are now tuples of address and time to
>     represent the
>     >     >     mutability of the program). Then things would proceed as
>     normal
>     >     >       * You'll discover that there will be a lot of the program
>     >     involved in
>     >     >     backwards SE that does not contribute to the answer to the
>     >     original
>     >     >     question. You might try and solve this by computing data and
>     >     control
>     >     >     dependence and performing slicing, but then you will
>     find that...
>     >     >       * ... the ability to do this analysis hinges on the
>     >     precision of your
>     >     >     alias analysis. You'll try and implement VSA and cry a lot.
>     >     Even once
>     >     >     correctly implemented, VSA might diverge during analysis.
>     >     >       * The problem will also nest like a Matryoshka doll,
>     >     programs will
>     >     >     compute the location of the get address / network connect
>     >     functions by,
>     >     >     on a good day, using facilities like
>     dlopen/GetProcAddress and
>     >     on a bad
>     >     >     day by walking data structures in memory. Even
>     identifying the
>     >     locations
>     >     >     where the calls occur will become its own problem equal in
>     >     complexity to
>     >     >     identifying the set of possible arguments. It's a cycle of
>     >     violence and
>     >     >     poverty with no end.
>     >     >      * You'll discover that existing symbolic models of
>     operating
>     >     systems
>     >     >     are totally inadequate. There will be calls to functions
>     that
>     >     change the
>     >     >     way the program interacts with the outside world in
>     >     fundamental ways and
>     >     >     you'll need to figure out how to represent them (what is the
>     >     symbolic
>     >     >     effect of LoadLibrary? MapViewOfSection?
>     ReadProcessMemory? ugh)
>     >     >      * Maybe we could use dynamic execution? Ugh that's cheating
>     >     though!
>     >     >     Also what appliances like FireEye and open sandboxes like
>     >     cuckoo already
>     >     >     do. Oh well, maybe that's good enough? I'll go buy a
>     FireEye.
>     >     >
>     >     >     On 03/08/2015 06:01 AM, john.wil...@pinfosec.com
>     <mailto:john.wil...@pinfosec.com>
>     >     <mailto:john.wil...@pinfosec.com
>     <mailto:john.wil...@pinfosec.com>>
>     >     >     <mailto:john.wil...@pinfosec.com
>     <mailto:john.wil...@pinfosec.com>
>     >     <mailto:john.wil...@pinfosec.com
>     <mailto:john.wil...@pinfosec.com>>> wrote:
>     >     >     > Greetings,
>     >     >     >
>     >     >     > I have lurked here long enough. I love this list and the
>     >     perspective it
>     >     >     > brings.
>     >     >     >
>     >     >     > I have a project idea of performing application security
>     >     assessments on
>     >     >     > binaries of unknown or questionable origin using one
>     >     specific objective:
>     >     >     >
>     >     >     > Determining where in the code network calls are performed,
>     >     then tracing
>     >     >     > back through the code to identify the destination address
>     >     (hostname, IP,
>     >     >     > or other).
>     >     >     >
>     >     >     > To me it seems that this is of most value, as any malware
>     >     intent upon
>     >     >     > stealing data or being part of a botnet must
>     communicate via
>     >     the network
>     >     >     > at some point. Surely there are other innovative
>     methods of
>     >     >     > communicating, but I am focused on the network connection.
>     >     >     >
>     >     >     > Some of my security colleagues say that what I want to
>     do is
>     >     "too hard".
>     >     >     > To me, this translates to:
>     >     >     >
>     >     >     >   * It is an important problem to solve
>     >     >     >   * Hard problems are best solved by first properly
>     >     characterizing the
>     >     >     >     problem
>     >     >     >   * Once a hard problem is properly characterized, then
>     >     solving it
>     >     >     >     becomes much easier
>     >     >     >
>     >     >     > While not directly related to language parsing, this list
>     >     would seem to
>     >     >     > best understand my perspective on the problem.
>     >     >     >
>     >     >     > Assume that the binary is capable of being reversed.
>     >     >     >
>     >     >     > This brings me to my questions for the list:
>     >     >     > 1. Are you aware of anyone else that has tried to do this?
>     >     If so, where
>     >     >     > can I find details?
>     >     >     > 2. Do you have any suggestions on where to start or how to
>     >     go about
>     >     >     > properly modeling this problem?
>     >     >     > 3. Does anyone have the expertise and interest in pursuing
>     >     such a project?
>     >     >     >
>     >     >     > I would start with various tools that facilitate
>     analysis using
>     >     >     > intermediate representation and control flow graph data...
>     >     >     >
>     >     >     > Thanks,
>     >     >     > John
>     >     >     > LinkedIn.com <http://LinkedIn.com>\in\johnmwillis
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     >
>     >     >     > _______________________________________________
>     >     >     > langsec-discuss mailing list
>     >     >     > langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>
>     >     <mailto:langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>>
>     >     >     <mailto:langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>
>     >     <mailto:langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>>>
>     >     >     > 
> https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
>     >     >     >
>     >     >     _______________________________________________
>     >     >     langsec-discuss mailing list
>     >     >     langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>
>     >     <mailto:langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>>
>     >     >     <mailto:langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>
>     >     <mailto:langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>>>
>     >     >   
>      https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
>     >     >
>     >     >
>     >     _______________________________________________
>     >     langsec-discuss mailing list
>     >     langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>
>     >     <mailto:langsec-discuss@mail.langsec.org
>     <mailto:langsec-discuss@mail.langsec.org>>
>     >     https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss
>     >
>     >
> 
> 
_______________________________________________
langsec-discuss mailing list
langsec-discuss@mail.langsec.org
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

Reply via email to