On Mar 8, 2012, at 11:17 AM, Michal Zalewski wrote: >> There are many SQLI patterns that are hard for automated tools to >> find. This is an obvious point, so I'm sorry to pedantic, but I think >> a survey based on automated scanning is a misleading starting point >> for the discussion. > > Well, the definition of a web application is a surprisingly > challenging problem, too. This is particularly true for any surveys > that randomly sample Internet destinations. > > Should all the default "it works!" webpages produced by webservers be > counted as "web applications"? In naive counts, they are, but > analyzing them for web app vulnerabilities is meaningless. In > general, at what level of complexity does a "web application" begin, > and how do you measure that when doing an automated scan? > > Further, if there are 100 IPs that serve the same www.youtube.com > front-end to different regions, are they separate web applications? In > many studies, they are. On the flip side, is a single physical server > with 10,000 parked domains a single web application? Some studies see > it as 10,000 apps.
[more about various subdomain configurations deleted] This is actually a researched topic, but in the area of massive web crawlers. The reason for this is that you need to balance: * Parallel queries to different domains for performance but not overload a single server hosting them * Make forward progress against different subdomains but not be vulnerable to a spider trap DNS that returns $(PRNG).example.com The best paper on this so far is for IRLBot: H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, "IRLbot: Scaling to 6 Billion Pages and Beyond" http://irl.cs.tamu.edu/people/hsin-tsang/papers/tweb2009.pdf See sections 6 and 7 for their scheme to balance these priorities. It's quite clever how they combine this with a disk-based queue to avoid running into RAM limits. The result is a web crawler that saturates the network link and has no weak points where it sits waiting for a robots.txt response or something. On your topic, perhaps you can apply some of their algorithms + some heuristics (exclude "it works" pages, find .php extensions, etc.) to get a fair estimates of the number of web apps at the subdomain level. This would leave out multiple web apps on a single subdomain, but at least it's a start. -Nate _______________________________________________ Dailydave mailing list [email protected] http://lists.immunityinc.com/mailman/listinfo/dailydave
