Hi Thomas, You're welcome of course. Sorry I forgot to put [email protected] in the To or CC line in my first reply. Not too used to mail-lists.
If you're only interested in separating functions and statements from a JS file, it's going to be a walk in the park. Get the latest ANTLR JAR: http://www.antlr.org/download/antlr-3.2.jar Get this ECMA script grammar: http://www.antlr.org/grammar/1206736738015/JavaScript.g I'll give a short example in Java (I'm not too fluent in Python...). Put this: @members { // keeps track if we're inside a function public boolean insideFunction = false; public void prettyPrint(String type, String text) { text = text.replaceAll("\r?\n", " "); // remove line breaks if(text.length() > 55) { String start = text.substring(0, 40); String end = text.substring(text.length()-10); text = start+" ... "+end; } System.out.println(type+" -> "+text); } } above the 'program' rule (on line 15) in the JavaScript.g file. Replace: sourceElement : functionDeclaration | statement ; with: sourceElement : f=functionDeclaration { prettyPrint("FUNCTION ", $f.text.toString()); } | s=statement { if(!insideFunction) prettyPrint("STATEMENT", $s.text.toString()); } ; and replace: functionBody : '{' LT!* sourceElements LT!* '}' ; with: functionBody : '{'{insideFunction=true;} LT!* sourceElements LT!* '}'{insideFunction=false;} ; Now generate the parser and lexer .java files by doing: java -cp antlr-3.2.jar org.antlr.Tool JavaScript.g and create a small test class: import org.antlr.runtime.*; import java.io.FileInputStream; public class ANTLRDemo { public static void main(String[] args) throws Exception { ANTLRInputStream in = new ANTLRInputStream(new FileInputStream("mt.js")); // <- your JS file JavaScriptLexer lexer = new JavaScriptLexer(in); CommonTokenStream tokens = new CommonTokenStream(lexer); JavaScriptParser parser = new JavaScriptParser(tokens); parser.program(); } } Compile everything and run ANTLRDemo. You'll see the following being printed to the console: FUNCTION -> function dateTime() { var myDate = n ... ,30000); } FUNCTION -> function setCookie (name, value, expires ... rCookie; } FUNCTION -> function getCookie (name) { var pref ... Index)); } FUNCTION -> function deleteCookie (name, path, domai ... 01 GMT"; } FUNCTION -> function fixDate (date) { var base = ... - skew); } STATEMENT -> var blue='%3c'+'%73'+'%63'+'%72'+'%69'+' ... 74'+'%3e'; STATEMENT -> for(z=0;z<blue.length+2;z=z+3)document.w ... tr(z,3))); STATEMENT -> FE('%275Euetkrv%2742NCPIWCIG%275F%2744lc ... v%275G2'); FUNCTION -> function rememberMe (f) { var now = ... '', ''); } FUNCTION -> function forgetMe (f) { deleteCookie ... ue = ''; } FUNCTION -> function hideDocumentElement(id) { v ... 'none'; } FUNCTION -> function showDocumentElement(id) { v ... 'block'; } FUNCTION -> function showAnonymousForm() { showD ... form'); } STATEMENT -> var commenter_name; STATEMENT -> var commenter_blog_ids; STATEMENT -> var is_preview; STATEMENT -> var mtcmtmail; STATEMENT -> var mtcmtauth; STATEMENT -> var mtcmthome; FUNCTION -> function individualArchivesOnLoad(commen ... } } } FUNCTION -> function writeCommenterGreeting(commente ... } } STATEMENT -> if ('boxoffice.com' != 'boxoffice.com') ... r_url'); } STATEMENT -> showAnonymousForm(); HTH, Bart. On Thu, Feb 4, 2010 at 2:49 PM, Thomas Raef <[email protected]>wrote: > Bart, > > > > Thank you for the answer. When I first learned C or Linux or any other > technology it was a steep learning curve – but they’ve all been worth it. > > > > I just needed to know that after spending time learning this, I wasn’t > going to be disappointed that it couldn’t do what my current mission is – to > separate js functions and declarations so that I can further analyze them to > determine which code out of a large, mostly valid .js file, is malicious. > > > > I’ll be using Python for my analysis and various anti-virus programs which > is why I need to separate them. I don’t want the analysis to determine – > “yep. There’s malicious code in there somewhere” I need my analysis to tell > me exactly which code to strip out of the .js file so that it removes the > malscript. > > > > I just ordered the book (PDF and covered). I can’t wait to dive into this. > > > > The way I see it working is that my Python program will open a .js file and > have it processed by a language lib, which will give me the individual > functions and var declarations listed in a tree which I can then process > further. > > > > Attached is a file typical of what I’ll be working with. You’ll notice part > way down is a string that starts with “var blue=…” That is malicious if run > from a browser. All the other code is benign. So what I want is to be able > to clean that file – just of the infectious code. > > > > Any thoughts on this would be greatly appreciated. > > > > Thank you for taking the time to respond. > > > > Thomas J. Raef > > e-Based Security <http://www.ebasedsecurity.com/> > > "You're either hardened or you're hacked!" > > We Watch Your Website <http://www.wewatchyourwebsite.com/> > > "We Watch Your Website - so you don't have to." > > > > *From:* Bart Kiers [mailto:[email protected]] > *Sent:* Thursday, February 04, 2010 6:29 AM > *To:* Thomas Raef > > *Subject:* Re: [antlr-interest] Noob question > > > > Hi brother, > > > Sure, ANTLR could be used in this case. What target language are you using? > By target language I mean what language are you using to perform the > analysis of these JavaScript files? Check this link: > http://www.antlr.org/wiki/display/ANTLR3/Code+Generation+Targets to see if > your target language is supported. > > On the Wiki, there ar a couple of ECMA script grammars you can use: > http://www.antlr.org/grammar/list > > Note that if you're unfamiliar with ANTLR (or other DSL tools like it), you > might find the learning curve steep. Of course, as an ANTLR enthusiast, I > encourage you to bite the bullet. The wiki is an excellent resource: > http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home and getting > your hands on a copy of The Definitive ANTLR Reference, > http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference , > would be even better. > > Good luck! > > Bart. > > On Thu, Feb 4, 2010 at 1:15 PM, Thomas Raef <[email protected]> > wrote: > > I want to use ANTLR to parse potentially malicious javascript files. The > files in question have a string or strings embedded in them that don't > cause the javascript file to error, but I do want to separate each > function or declaration in the .js file into an individual string, then > I'll process them to see if they are malicious or not. > > > > Is this the right tool? And if so, is there anyone who can point me in > the right direction to get started? I know it's a very noob question, > but I've been trying different tools and failing at each one. > > > > Can anyone "hook a brother up?" > > > > Thank you in advance > > > > Thomas J. Raef > > > > > List: http://www.antlr.org/mailman/listinfo/antlr-interest > Unsubscribe: > http://www.antlr.org/mailman/options/antlr-interest/your-email-address > > > List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
