It's not that incendiary in my opinion, there's a known (almost Heisenberg-like) tradeoff between scalability and precision of any type of program analysis. One useful part (perhaps the only useful part) about analyzing malware is that you generally commit to analyzing it until you have enough information or you stop caring, so a tactic that could scale would be to give hours or days of CPU time to understanding a sample, because the alternative is to give hours or days of human analyst time.
There are also really good reasons (I think) to want to do large scale property checking, say you want to prove that event handlers in your operating system terminate (possible, been done, cool work) or using type checking to prove absence of XSS-type errors in a web application. Some interesting academic work has been done to walk the tightrope between scale and precision in the SATURN project (http://saturn.stanford.edu/pages/papersindex.html) and they had some good impact/results in finding bugs in the Linux kernel. Of course, since these programs are malicious you could probably create an analysis to answer "can valuable stuff go out of the network hole" just by writing "def analysis(Program): return True". It could be interesting if you could answer questions like: - What is the domain generation algorithm? - How are communications to C2 encrypted? Are they encrypted? Is it symmetric? Can the key be made available? - Does this program refer to other programs or components on the system or network that could implicate other programs or systems in malicious activity? and other kinds of questions that teams of (expensive) incident responders iterate on for hours/days/weeks, in less time at larger scale. However as far as I can tell this is kind of a white whale that many people have spent time chasing. I don't think that you have much hope for good intent characterization, though. And generally we don't seem to try for it. I could get an analysis result for an application that says "this will read off your pictures and store them in a remote server". If it's the Dropbox client, that's fine. The difference I think comes when the capabilities/actions differ from expectations, and thanks to OEM bundling (superfish) it's difficult to build a set of expectations... On 03/08/2015 04:28 PM, Christien Rioux wrote: > Okay, yea , I see your point. > > Once you assume 'yea its malware' and want to know 'so what is it REALLY > doing', you have to toe the line between 'what is it doing' and 'what > does the program 'mean''? At some point, you have to decide when > 'intent' is implied by semantics. That's getting pretty close to 'AI' > rather than 'program analysis'. > > The level of specificity of 'program capability' is another factor. You > could generally say 'yea it accesses the disk in this here part' > without, say, building a model of how it modifies the disk. I'm not yet > at the point in this discussion where I know what level of detail the OP > cares about. Are we modeling down to intent? It sounds like he simply > wanted to know "can valuable stuff go out the network hole". Which might > be answerable with a much simpler test than 'model exactly what this > program is capable of'. > > (Frankly, I think whole program understanding is an anti-pattern of this > kind of analysis, but that's just me being incendiary). > > --chris > > On Sun, Mar 8, 2015 at 1:16 PM, Andrew <mu...@mimisbrunnr.net > <mailto:mu...@mimisbrunnr.net>> wrote: > > Ah, but to your #2, this is the entire point of the system that the OP > proposes. I'm going to assume that if a program is analyzed by this > system we already know it is malware, and we want to know more (well > usually it is the customer that wants to know more, a quicker path to > success might be to drop the customer...) > > There are almost certainly advanced things you can do in the face of > complicated hostname generation procedures - using counterexample driven > abstraction refinement to automatically synthesize a DGA from a binary > would be fun (but I wouldn't be surprised if someone has done or tried > this before). Perhaps these are inherently academic in nature, but > perhaps not? I'm not sure. I've known very practically minded people to > value information like that before, and techniques to assist in the > acquisition of that information could also be of help. > > One thing we tried to do with mcsema was to leave the door open to > interactive approaches as much as possible. We have some integration > with IDA, in that IDAPython scripts can extract information about the > program from IDA and pass it to mcsema which will then translate that > program into LLVM. I believe similar types of cooperation are possible > now with bap and IDA. Also, Axel's group at Munich open-source released > their bindead binary analysis platform: > https://bitbucket.org/mihaila/bindead/wiki/Home with their GUI frontend > you could also bridge some between a human analyst and mechanical static > analysis. > > On 03/08/2015 03:23 PM, Christien Rioux wrote: > > > > 0. Neat example with the data-driven pointer/value abstraction. > Another > > example of something similar is simply determining if the instruction > > you're looking at is encoded as ARM or THUMB, which is decoded > based on > > the setting of a bit in the CPU, that some other instruction can > change > > at any point. > > > > The question of whether or not the compiler in question (or the > > interpreting system) would allow something that is a 'pointer' or an > > 'integer' at the same instruction (or code path), is required > knowledge > > to interpret this one. If it's totally data driven, then yea, > you'd have > > to consider a type-voting mechanism to determine the likelihood that > > something at a particular reference is a 'value, a pointer, or either, > > or unknown'. My guess is semantics around the propagative path to the > > point of reference in many cases could determine if something is a > value > > or pointer or both without an exhaustive program search many times. > > Avoiding 'exhautive searches' at all costs and simply propagating > > 'unknown' at the point of resource exhaustion is practical here... > > > > 1. Our taint tracking is bidirectional and (given a useful slice) > > propagated all over the place. Taint 'sources' are propagated forward, > > and and taint 'sinks' are propagated backward. It does get tricky > doing > > this meaningfully over a wide berth of interprocedural activity, > and you > > do have to consider SMT-style path conditional requirements (what > things > > are known true to get this value over this path, reachability given > > dataflow contraints) but generally it does help with the 'does > untrusted > > input reach something that requires trusted input' question. > > > > 2. If something is not calculable in this process, due to some kind of > > funky 'obfuscation', typically we find that it's malware. The very > fact > > that a 'connect' call is being made to something that isn't > obvious, or > > comes from an obvious source, but rather some lengthy calculation, is > > usually a malware signature in its own right, at which point, it's an > > academic question of whether or not you care what it is 'actually > doing' > > rather than simply the fact that you really don't know what it is > doing. > > > > Cheers, > > > > --chris > > > > On Sun, Mar 8, 2015 at 11:45 AM, Andrew <mu...@mimisbrunnr.net > <mailto:mu...@mimisbrunnr.net> > > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>>> wrote: > > > > In malware, sometimes assumptions about these conventions > hold, and > > sometimes they don't :( It can make recovery of a high level model > > challenging. You're definitely right about breaking the > problem down > > into a high level model and then analysis of that high level > model. I > > still think that the problem you have to solve to analyze the > high level > > model requires the same technique as to generate the high > level model. > > So my sense is that in general, this type of analysis is repeated > > iteration of an analysis routine towards convergence on a > fixed point, > > but there is no upper bound on the number of iterations and > this is > > where sadness enters. That might be too pessimistic. > > > > One example of challenges in recovery is analyzing native > haskell or > > ocaml code. Once you know the model it gets "real world easy" > since > > there are a limited number of ocaml native compilers, but > until you > > incorporate that things make no sense. The runtime that the code > > generator creates encodes pointers and integers by tagging the > low bit. > > If the low bit is set, the value is a pointer. So the nil > pointer is the > > constant 1, not the constant 0, which can make certain type of > inference > > operations more interesting, because iterative list traversals > now test > > cur = 1 instead of cur = 0. Once you know the trick this is less > > challenging, but with "binary" analysis, especially > adversarial binary > > analysis, there can be a bottomless abyss of these tricks for > you to > > discover one by one. > > > > Other languages use CPS-style code generation that is nothing > like code > > generated by C compilers (at least compiled ocaml is generally > quite > > imperative). For example, a program where everything you would > think of > > as call or return is implemented via "jmp [eax]". > > > > So producing this high level model for arbitrary programs is, > I think, > > quite challenging and is something that I've been working on, with > > varying degrees of success. There are other problems too, like > > adversarial compilers using undefined flag values for control > decisions > > and basing program logic off of undefined or undocumented CPU > behavior. > > These are things that "compilers" don't do, but "binaries" can do. > > > > I'm also not sure that taint tracking is the tool to solve the > problem > > from the OP, at least in a forward direction (which is usually > how I've > > seen taint tracking expressed? I guess you could have > backwards taint > > tracking). You want to produce a set of possible values for a > given call > > to connect() (and probably also link those to calls to > gethostbyname() > > and produce a set of possible values for calls to that function as > > well). So what values do you taint? > > > > The string content for the calls to gethostbyname() will almost > > certainly, in the adversarial case, be obfuscated, often in > ways that > > are not as simple to analyze as > > > > char *thec2host = deobfuscate(obfuscated_blob); > > > > where deobfuscate : char * -> char * and is pure. So some > degree of > > symbolic analysis and simplification will be required to > provide answers > > to the user about what the actual value of the string being > passed is. > > > > I think that in my original assessment I was too pessimistic, > possibly > > from remembering the years of my life that have gone into > tackling this > > problem so far. I think that progress could definitely be made > now on > > crossing a 50% threshold using static analysis techniques, mostly > > backwards SE and slicing. Going further than that will suck > though, and > > theoretical results tell us that once the adversary becomes > wise to this > > technique they have access to an infinite amount of cleverness > to attack > > the system with, and the system won't have a generic way to > respond. > > This is also a path to {sad|mad}ness. > > > > On 03/08/2015 02:17 PM, Christien Rioux wrote: > > > If you're not talking about the output of a compiler when you're > > talking > > > about analyzing binaries, I'm not sure what you're referring > to... > > Even > > > a human being writing raw assembly is a bad compiler :P > > > > > > Also, only our C++ analysis uses debug symbols for assistance > > (although > > > considerable work has been done on not requiring them, > getting funding > > > and time to build a generalized decompiler for C++ isn't > something > > that > > > seems to have the return on investment that one might hope.) > > > > > > All the other languages seem to come along just fine without > them. We > > > require the upload of debug symbols from our customers to > ensure that > > > they actually do have the code to the program they're analyzing. > > At the > > > same time, not everything we analyze (C++ included!) has > complete > > debug > > > symbols, so we have to get along without them for a good portion > > of the > > > program's analysis. (Many times, at least half of the enterprise > > > programs we look at have some commercial libraries and are > submitted > > > without debug symbols for these parts). > > > > > > Anyway, it's safe to assume that a compiler will make > certain choices > > > about layout. The trick lies in figuring out what > assumptions are safe > > > to make and which ones are not. Knowing what the compiler > must have > > > known at a certain point in the process lets you get at > that. Things > > > like 'calling conventions' are valid in certain states (external > > > function calls for example, but not internal calls that are > possibly > > > optimized), and compilers are/should-be deterministic in their > > output :) > > > Things like 'how vtables are laid out' and 'how structures are > > laid out' > > > doesn't really vary, even when code is optimized. Things > like 'base > > > pointer usage' can not be relied upon, but 'stack pointer > usage' is > > > baked into the notion of a 'call stack'. Getting things like > 'varargs > > > detection' right is a giant pain in the rump but it's > doable, (what > > > does it look like you call a varargs function, passing in as an > > argument > > > the return value from another varargs function? boggle) > > > > > > Regardless, the high-level model, once recovered, needs to > be complete > > > enough to be searchable in a meaningful way, and that's > really where > > > your original question gets started. So the problem should to be > > broken > > > down into "recovering a high level model" from the binary, and > > > "searching that model for something meaningful". Crossing > the streams > > > there is the path to madness. > > > > > > Anyway, those two problems have very different profiles, > which are you > > > more interested in tackling ? :) > > > > > > --chris > > > > > > > > > On Sun, Mar 8, 2015 at 10:47 AM, Andrew > <mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net> > > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>> > > > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net> > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>>>> wrote: > > > > > > Your system is more "compiler output analysis" though, > because you make > > > assumptions about code/data layout as generated by a > compiler, and rely > > > on the presence of debug information, right? I seem to > recall this was > > > the case a few years ago but have not been keeping up to > date with what > > > your system has been doing since. > > > > > > On 03/08/2015 01:43 PM, Christien Rioux wrote: > > > > > > > > Veracode does this all day. > > > > > > > > We've been doing static binary analysis, based on a > combined data flow > > > > and control flow model, and solving the problem you're > discussing using > > > > a combination of demand pointer analysis and taint > tracking. > > > > > > > > I've got some slides on the Veracode binary modeling > system here: > > > > > > https://dl.dropboxusercontent.com/u/458169/Lessons%20Of%20Binary%20Analysis-SEATTLE.pptx > > > > > > > > This doesn't cover the actual 'DPA' or taint tracking, > but rather how to > > > > effectively model the binary in the first place. These > days, a number of > > > > other binary modeling systems exist, but they do not > necessarily cover > > > > all the things you've stated as problems. > > > > > > > > Feel free to message privately if you want to know > more, or if you think > > > > the questions are langsec-related specifically, reply > here :) > > > > > > > > Cheers, > > > > > > > > --chris > > > > > > > > On Sun, Mar 8, 2015 at 9:50 AM, Andrew > <mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net> > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>> > > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net> > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>>> > > > > <mailto:mu...@mimisbrunnr.net > <mailto:mu...@mimisbrunnr.net> <mailto:mu...@mimisbrunnr.net > <mailto:mu...@mimisbrunnr.net>> > > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net> > <mailto:mu...@mimisbrunnr.net <mailto:mu...@mimisbrunnr.net>>>>> wrote: > > > > > > > > This problem is probably considered hard because > many have > > > tried and > > > > failed (myself included, yay), or at least have not > > achieved a > > > close > > > > enough approximation to success as to be satisfying. > > > > > > > > 1. > > > > a. The work I like the most is this system for > doing this > > > analysis on > > > > Android bytecode for malware analysis: > > > > > http://matt.might.net/papers/liang2013anadroid2.pdf . It > > might > > > seem > > > > "theory heavy" but they have empirical evaluations > that are > > > quite good. > > > > > > > > b. The Jakstab platform also kind of tackles > this, read > > > their paper > > > > set here and check out their code: > > > http://www.jakstab.org/documentation > > > > > > > > c. The closest we got to publishing our work on this > > system > > > was to > > > > patent it, it isn't the best reading but it is here: > > > > http://www.google.com/patents/US8533836 > > > > > > > > 2. > > > > If I were going to start this again, I'd build > on top > > of mcsema > > > > (https://github.com/trailofbits/mcsema) or bap > > > > (https://github.com/BinaryAnalysisPlatform/bap) > and use > > backwards > > > > symbolic execution from call sites to network > functions. > > This > > > will fail > > > > for a number of reasons: > > > > * Initially, control flow analysis. Neither bap > nor mcsema > > > will be > > > > able to deal with malware and self modifying code. > You'd > > > probably wind > > > > up building a control flow translation system on the > > front end > > > of the > > > > pipeline to do unpacking, either via concrete or > symbolic > > > execution, and > > > > storing a new control flow graph as a state-aware > > control flow > > > graph > > > > (addresses are now tuples of address and time to > > represent the > > > > mutability of the program). Then things would > proceed as > > normal > > > > * You'll discover that there will be a lot of > the program > > > involved in > > > > backwards SE that does not contribute to the > answer to the > > > original > > > > question. You might try and solve this by > computing data and > > > control > > > > dependence and performing slicing, but then you will > > find that... > > > > * ... the ability to do this analysis hinges on the > > > precision of your > > > > alias analysis. You'll try and implement VSA and > cry a lot. > > > Even once > > > > correctly implemented, VSA might diverge during > analysis. > > > > * The problem will also nest like a Matryoshka doll, > > > programs will > > > > compute the location of the get address / network > connect > > > functions by, > > > > on a good day, using facilities like > > dlopen/GetProcAddress and > > > on a bad > > > > day by walking data structures in memory. Even > > identifying the > > > locations > > > > where the calls occur will become its own problem > equal in > > > complexity to > > > > identifying the set of possible arguments. It's a > cycle of > > > violence and > > > > poverty with no end. > > > > * You'll discover that existing symbolic models of > > operating > > > systems > > > > are totally inadequate. There will be calls to > functions > > that > > > change the > > > > way the program interacts with the outside world in > > > fundamental ways and > > > > you'll need to figure out how to represent them > (what is the > > > symbolic > > > > effect of LoadLibrary? MapViewOfSection? > > ReadProcessMemory? ugh) > > > > * Maybe we could use dynamic execution? Ugh > that's cheating > > > though! > > > > Also what appliances like FireEye and open > sandboxes like > > > cuckoo already > > > > do. Oh well, maybe that's good enough? I'll go buy a > > FireEye. > > > > > > > > On 03/08/2015 06:01 AM, john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com> > > <mailto:john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com>> > > > <mailto:john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com> > > <mailto:john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com>>> > > > > <mailto:john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com> > > <mailto:john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com>> > > > <mailto:john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com> > > <mailto:john.wil...@pinfosec.com > <mailto:john.wil...@pinfosec.com>>>> wrote: > > > > > Greetings, > > > > > > > > > > I have lurked here long enough. I love this list > and the > > > perspective it > > > > > brings. > > > > > > > > > > I have a project idea of performing application > security > > > assessments on > > > > > binaries of unknown or questionable origin using one > > > specific objective: > > > > > > > > > > Determining where in the code network calls are > performed, > > > then tracing > > > > > back through the code to identify the > destination address > > > (hostname, IP, > > > > > or other). > > > > > > > > > > To me it seems that this is of most value, as > any malware > > > intent upon > > > > > stealing data or being part of a botnet must > > communicate via > > > the network > > > > > at some point. Surely there are other innovative > > methods of > > > > > communicating, but I am focused on the network > connection. > > > > > > > > > > Some of my security colleagues say that what I > want to > > do is > > > "too hard". > > > > > To me, this translates to: > > > > > > > > > > * It is an important problem to solve > > > > > * Hard problems are best solved by first properly > > > characterizing the > > > > > problem > > > > > * Once a hard problem is properly > characterized, then > > > solving it > > > > > becomes much easier > > > > > > > > > > While not directly related to language parsing, > this list > > > would seem to > > > > > best understand my perspective on the problem. > > > > > > > > > > Assume that the binary is capable of being reversed. > > > > > > > > > > This brings me to my questions for the list: > > > > > 1. Are you aware of anyone else that has tried > to do this? > > > If so, where > > > > > can I find details? > > > > > 2. Do you have any suggestions on where to start > or how to > > > go about > > > > > properly modeling this problem? > > > > > 3. Does anyone have the expertise and interest > in pursuing > > > such a project? > > > > > > > > > > I would start with various tools that facilitate > > analysis using > > > > > intermediate representation and control flow > graph data... > > > > > > > > > > Thanks, > > > > > John > > > > > LinkedIn.com <http://LinkedIn.com>\in\johnmwillis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > langsec-discuss mailing list > > > > > langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>> > > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>>> > > > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>> > > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>>>> > > > > > > https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss > > > > > > > > > _______________________________________________ > > > > langsec-discuss mailing list > > > > langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>> > > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>>> > > > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>> > > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>>>> > > > > > > https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss > > > > > > > > > > > _______________________________________________ > > > langsec-discuss mailing list > > > langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>> > > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org> > > <mailto:langsec-discuss@mail.langsec.org > <mailto:langsec-discuss@mail.langsec.org>>> > > > > https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss > > > > > > > > > > > > _______________________________________________ langsec-discuss mailing list langsec-discuss@mail.langsec.org https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss