http://gwt-code-reviews.appspot.com/70802/diff/1/7 File eclipse/tools/simple-crawler/.project (right):
http://gwt-code-reviews.appspot.com/70802/diff/1/7#newcode1 Line 1: <?xml version="1.0" encoding="UTF-8"?> For some reason I can't import this into eclipse successfully. You might want to see if it is importable for you into a clean eclipse instance. http://gwt-code-reviews.appspot.com/70802/diff/1/2 File tools/simple-crawler/src/com/google/gwt/crawler/Settings.java (right): http://gwt-code-reviews.appspot.com/70802/diff/1/2#newcode92 Line 92: + " requires an argument"); No need for line break http://gwt-code-reviews.appspot.com/70802/diff/1/2#newcode119 Line 119: next_argument : while (!remainingArguments.isEmpty()) { Wow, slick java syntax http://gwt-code-reviews.appspot.com/70802/diff/1/2#newcode127 Line 127: if ((settings.sitemap.get() == null) && (settings.initUrl.get() == null)) { Add a blank line before and a comment describing the restrictions being enforced http://gwt-code-reviews.appspot.com/70802/diff/1/2#newcode160 Line 160: private List<Setting<?>> allSettings; Initialize to new ArrayList<Setting<?>> here http://gwt-code-reviews.appspot.com/70802/diff/1/2#newcode165 Line 165: } Remove this in favor of initializer above http://gwt-code-reviews.appspot.com/70802/diff/1/3 File tools/simple-crawler/src/com/google/gwt/crawler/SimpleCrawler.java (right): http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode16 Line 16: package com.google.gwt.crawler; I would suggest something like com.google.gwt.tools.simplecrawler so it doesn't sound overly official http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode37 Line 37: * PLEASE READ, THIS IS NOT THE USUAL YADI YADI. More commonly spelled "yada yada" or "yadda yadda" http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode43 Line 43: * is initialilzed either by one web site or by a sitemap, and it follows initialized "one web site" -> "a uri" ? http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode51 Line 51: * queue of pages yet to be crawled Style: capitalize and end with a period (same for comments below) Why is this public (same comment below) -- use the most private scope that gets the job done -- if access is needed from outside this class, consider accessor methods instead http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode57 Line 57: public static Set<String> alreadySeenSet = new TreeSet<String>(); TreeSet is typically slower than HashSet. Do you rely on a particular ordering? http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode61 Line 61: public static PrintWriter out; !public This could be initialized to new PrintWriter(System.out, true) right here. http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode69 Line 69: public static String getUrlContent(String urlString) { !public http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode74 Line 74: urlInputStream = url.openStream(); Collapse two line above into one line http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode80 Line 80: content = contentScanner.useDelimiter("\\A").next(); kewl http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode87 Line 87: return null; Print something and make the exception/stack trace available rather than dying silently http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode92 Line 92: * entry point Caps & period http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode95 Line 95: SAXException, IOException { This never actually throws IOException http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode118 Line 118: System.err.println("Could not open sitemap file: " + e.getMessage()); Wrap this try/catch more tightly around the code that can trigger it, i.e., the call to parseSitemap. Otherwise some future revision could create the opportunity for another IOException in the "big" try/catch block, and it would be incorrect to generate such a a specific error message here http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode124 Line 124: * creates a full Url from a hash fragment Caps & period http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode130 Line 130: String toReturn; Remove -- this is not C! http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode131 Line 131: String baseUrl = nextUrl.replaceAll("\\?.*", ""); Comment what is happening here (removing trailing ? stuff) http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode136 Line 136: toReturn = baseUrl + hashFragment; return baseUrl + hashFragment; http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode141 Line 141: * Map to crawler Url, which in short means #@: will be mapped to Stray colon? http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode147 Line 147: public static String mapToCrawlerUrl(String url) { Comment how this works, with examples http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode165 Line 165: public static String mapToOriginalUrl(String url) { Comment how this works, with examples http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode174 Line 174: * gets the content for all the Urls on the queue, extracts links from it, and Caps & period http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode198 Line 198: out.println(nextUrlContent); This could get pretty verbose http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode199 Line 199: String hrefPattern = "href=\"(.*)\""; What if the href appears in a comment, or in non-tag text? Also, does this match the shortest or longest pattern? If not guaranteed to be the shortest, you need to ensure the .* part excludes quotation marks, e.g., [^"]* http://gwt-code-reviews.appspot.com/70802/diff/1/3#newcode200 Line 200: Matcher matcher = Pattern.compile(hrefPattern).matcher(nextUrlContent); Pattern.compile should be called in a static initializer since the pattern never changes. It's a slow operation that should definitely not be in the inner loop of your app. http://gwt-code-reviews.appspot.com/70802/diff/1/4 File tools/simple-crawler/src/com/google/gwt/crawler/SimpleSitemapParser.java (right): http://gwt-code-reviews.appspot.com/70802/diff/1/4#newcode37 Line 37: public class SimpleSitemapParser { Document the file format, with an example. http://gwt-code-reviews.appspot.com/70802/diff/1/4#newcode63 Line 63: private static DefaultHandler parseSiteMap(final Queue<String> urlQueue) { Should rename, since this method doesn't actually parse, but instead returns an object that will be used for parsing later http://gwt-code-reviews.appspot.com/70802/diff/1/4#newcode65 Line 65: Delete empty line http://gwt-code-reviews.appspot.com/70802/diff/1/4#newcode86 Line 86: Delete empty line http://gwt-code-reviews.appspot.com/70802 --~--~---------~--~----~------------~-------~--~----~ http://groups.google.com/group/Google-Web-Toolkit-Contributors -~----------~----~----~----~------~----~------~--~---
