your comment in JIRA reminded me of another issue that i have encountered when I'm working with Droids. For web server uses default index file, e.g. http://www.apache.org/index.html http://www.apache.org/
two links refer to the same document. Ideally, the crawler framework should have some ways to manage this. I don't have a clear suggestion in mind, however. If i didn't recall it incorrectly, one of the other crawlers actually did some fake request to test certain server behavior, e.g. whether a 404 error result as a normal 200 page will error message. Maybe there is a single configuration (disabled by default). if enabled, when a "/index.*" is encountered, it will make a request to "/" to test if the content is more or less the same. If a host is detected as with default index, URI with the matched index file are assumed to be identical as '/'. This is one of the approaches we could consider. Now, we have to code the handling logic in our own handler. regards, mingfai On Thu, Apr 2, 2009 at 8:48 PM, Mingfai <[email protected]> wrote: > thx, I looked at DROIDS-8 and DROIDS-11, and it seems it is better to > create a new issue to keep track of resolve failure cases. > https://issues.apache.org/jira/browse/DROIDS-45 > > Tiki/NekoHTML is responsible for parsing and provision of the SAX event > based parser. The actual URI resolving logic is implemented in Droids as of > now. And in fact, in the 2nd last comment in DROIDS-8, you guys think the > Link Extractor doesn't depend on Tiki and should be moved to core. > > I wonder if you agree the default Droids Link Extractor should behave as > same as a Mozilla browser for basic HTML link, ideally? (and for sure it > can't handle JavaScript link) > > re. JDK bug, I'm using the JDK 1.6.0_12 and got the problem. Maybe no one > has ever reported the issue or they just don't care about that. My point is, > it seems we have to workaround the bug rather than hoping Sun Microsystems > to fix the bug and every body upgrade to the latest revision. :-) > > Regards, > mingfai > > > > On Thu, Apr 2, 2009 at 8:19 PM, Thorsten Scherler < > [email protected]> wrote: > >> On Thu, 2009-04-02 at 19:38 +0800, Mingfai wrote: >> > let's just look at the specific case first. Maybe I have jumped to the >> > conclusion that the Link Extraction feature is too simple too soon. >> > >> > At line 139 of LinkExtractor.java, it uses URI.resolve(String) to >> resolve a >> > URI. >> > if (!target.toLowerCase().startsWith("javascript") >> > && !target.contains(":/")) { >> > 139: return base.getURI().resolve(target.split("#")[0]); >> > } >> > else if (!target.toLowerCase().startsWith("javascript")) { >> > return new URI(target.split("#")[0]); >> > } >> > >> > When I test the URI API with: >> > new URI("http://www.google.com").resolve("index.php") >> > it resolves the url to "http://www.google.comindex.php" >> > >> > if you didn't mean it is a bug with my JDK, then we need to specially >> append >> > a "/" prefix >> >> Hmm, >> >> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#resolve(java.net.URI)<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#resolve%28java.net.URI%29> >> "... >> 3.Otherwise the new URI's authority component is copied from this URI, >> and its path is computed as follows: >> >> A. If the given URI's path is absolute then the new URI's path is >> taken from the given URI. >> >> B. Otherwise the given URI's path is relative, and so the new URI's >> path is computed by resolving the path of the given URI against >> the path of this URI. This is done by concatenating all but the >> last segment of this URI's path, if any, with the given URI's >> path and then normalizing the result as if by invoking the >> normalize method. >> ..." >> >> That sounds that new URI("http://www.apache.org").resolve("index.html") >> should return http://www.apache.org/index.html. Since it reads: "the >> result as if by invoking the normalize method" >> >> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#normalize()<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html#normalize%28%29> >> >> >> > >> > And previously, I found another scenario that doesn't work, when there >> is a >> > link <a href="?test=true">test</a> under www.google.com/index.php , it >> > resolves to www.google.com/?test=true rather than >> > www.google.com/index.php?test=true like in a web browser. >> > >> > This makes me feel there are many special scenario that a crawler need >> to >> > cater. What do you think? Is it really so simple? My suggest to add a >> page >> > is for listing those special scenarios, that sometimes maybe just cause >> by >> > non-standard usage. >> >> Actually that should normally be handled by the above linked methods. >> Please comment on issue DROIDS-8/DROIDS-11 if you find that the link >> extraction is not working as expected. >> >> salu2 >> >> > >> > regards, >> > mingfai >> > >> > >> > >> > >> > On Thu, Apr 2, 2009 at 7:24 PM, Thorsten Scherler < >> > [email protected]> wrote: >> > >> > > On Thu, 2009-04-02 at 18:53 +0800, Mingfai wrote: >> > > > hi, >> > > > >> > > > The default LinkExtractor seems to be quite simple. (too simple) It >> > > mainly >> > > > uses URI.resolve and only cater the # and javascript scenarios. >> (from >> > > > LinkExtractor.java getURI) Simple usage link resolving a <a >> > > > href="test.html"> for new URI("http://www.google.com") will be >> wrong as >> > > it >> > > > will return a http://www.google.comtest.html". >> > > >> > > Well the link extraction always worked well. The case you just pointed >> > > out looks like a bug BUT if you mean new URL >> > > ("http://testServer.com","test.html)) then have a look at >> > > >> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL(java.net.URL<http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL%28java.net.URL> >> < >> http://java.sun.com/j2se/1.4.2/docs/api/java/net/URL.html#URL%28java.net.URL >> >, >> > > java.lang.String) >> > > >> > > > And there are many case that >> > > > the URI.resolve doesn't cater. It seems to me we need to do some >> works at >> > > > this area to make Droids more usable. Does anyone have any >> experience in >> > > out >> > > > link extraction? >> > > >> > > Enhancements are always welcome however the link extraction should >> work >> > > fine. At least when I last looked at it was fine. The limitation ATM >> is >> > > the extraction of jscript generated links. >> > > >> > > > >> > > > I'm trying to see how other frameworks handle out link extraction >> and >> > > looked >> > > > at: >> > > > >> > > >> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java?view=log >> > > >> > > Funny enough that have been the base of droids outlink extraction in >> the >> > > first version I hacked. >> > > >> > > > >> > > >> https://archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix2/engine/src/main/java/org/archive/extractor/RegexpHTMLLinkExtractor.java >> > > > (Heritrix's JavaDoc shows they have given some good thought in >> handling >> > > > different tags and attributes) >> > > > >> > > > What do you think if I add a wiki page that list out some scenarios >> of >> > > > outlink handling (i.e. the requirement)? Or does anyone know if any >> of >> > > the >> > > > many Java crawler projects have documentation at this area? >> > > >> > > If you do not look into jscript/ajax link extraction then there is no >> > > secret to it. Either go with xpath expression or e.g. for plain text >> > > with regexp. Please fell free to open a wiki page around the issue. >> > > >> > > salu2 >> > > >> > > > >> > > > regards, >> > > > mingfai >> > > -- >> > > Thorsten Scherler <thorsten.at.apache.org> >> > > Open Source Java <consulting, training and solutions> >> > > >> > > Sociedad Andaluza para el Desarrollo de la Sociedad >> > > de la Información, S.A.U. (SADESI) >> > > >> > > >> > > >> > > >> > > >> -- >> Thorsten Scherler <thorsten.at.apache.org> >> Open Source Java <consulting, training and solutions> >> >> Sociedad Andaluza para el Desarrollo de la Sociedad >> de la Información, S.A.U. (SADESI) >> >> >> >> >> >
