On Thu, 2009-04-02 at 06:10 -0700, Mingfai Ma (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694995#action_12694995 > ] > > Mingfai Ma commented on DROIDS-45: > ---------------------------------- > > the LinkExtractor doesn't append '/' automatically.
Hmm, I just asked Javier to have a look into this since he had been the last that worked with the code. Will try to find some time to debug this weekend since need to finish a project ATM. One small thing I found in your class is the constructor, it is not optimal since we would be forced to create a lot of instances (for each base/link), that needs rethinking to reuse the class. salu2 > and I think it shouldn't, as it is possible for a server to handle with and > without '/' differently. For root domain URL, it may be ok. but for deeper > URL, we can't just assume the last segment of the request path is a directory > > Apache mod_dir should append a trailing slash but unfortunately, not all web > server on the internet have this feature enabled :-) > http://httpd.apache.org/docs/2.2/mod/mod_dir.html > > > Fail to resovle outlink correctly > > --------------------------------- > > > > Key: DROIDS-45 > > URL: https://issues.apache.org/jira/browse/DROIDS-45 > > Project: Droids > > Issue Type: Bug > > Components: core > > Affects Versions: 0.01 > > Reporter: Mingfai Ma > > > > I've encountered several cases that outlinks are not extracted correctly. > > Most are cause by the use of URI.resolve(). > > 1. For a base URI of new URI("http://www.domain.com"), <a > > href="test.html">test.html</a> will be resolved to > > http://www.domain.comtest.html > > 2. For a base URI of new URI("http://www.domain.com/index.php"), <a > > href="?test=true">test with param</a> will be resolved to > > http://www.domain.com/?test=true > > 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will > > throw exception. And in a browser, it can resolves the URI. (remarks: I > > didn't check if this scenario affect the default Tika/NekoHTML parsing. ) > > I suspect there are many different scenarios, many of them are probably > > caused by non-standard usage. (but a crawler has to handle non-standard > > usage in order to function) Obviously, we cannot cater every case, and I > > suggest to consider a resolve failure as a bug if a link works in a Mozilla > > browser but not in Droids LinkExtractor. > > this issue is related to the LinkExtractor created in DROIDS-8 > -- Thorsten Scherler <thorsten.at.apache.org> Open Source Java <consulting, training and solutions> Sociedad Andaluza para el Desarrollo de la Sociedad de la Información, S.A.U. (SADESI)
