Re: [jira] Commented: (DROIDS-45) Fail to resovle outlink correctly

Thorsten Scherler Fri, 03 Apr 2009 01:06:33 -0700

On Thu, 2009-04-02 at 06:10 -0700, Mingfai Ma (JIRA) wrote:
> [ 
> https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694995#action_12694995
>  ] 
> 
> Mingfai Ma commented on DROIDS-45:
> ----------------------------------
> 
> the LinkExtractor doesn't append '/' automatically.


Hmm, I just asked Javier to have a look into this since he had been the
last that worked with the code. Will try to find some time to debug this
weekend since need to finish a project ATM.

One small thing I found in your class is the constructor, it is not
optimal since we would be forced to create a lot of instances (for each
base/link), that needs rethinking to reuse the class.

salu2


> and I think it shouldn't, as it is possible for a server to handle with and 
> without '/' differently. For root domain URL, it may be ok. but for deeper 
> URL, we can't just assume the last segment of the request path is a directory
> 
> Apache mod_dir should append a trailing slash but unfortunately, not all web 
> server on the internet have this feature enabled :-)
> http://httpd.apache.org/docs/2.2/mod/mod_dir.html
> 
> > Fail to resovle outlink correctly
> > ---------------------------------
> >
> >                 Key: DROIDS-45
> >                 URL: https://issues.apache.org/jira/browse/DROIDS-45
> >             Project: Droids
> >          Issue Type: Bug
> >          Components: core
> >    Affects Versions: 0.01
> >            Reporter: Mingfai Ma
> >
> > I've encountered several cases that outlinks are not extracted correctly. 
> > Most are cause by the use of URI.resolve(). 
> > 1. For a base URI of new URI("http://www.domain.com";), <a 
> > href="test.html">test.html</a> will be resolved to 
> > http://www.domain.comtest.html
> > 2. For a base URI of new URI("http://www.domain.com/index.php";), <a 
> > href="?test=true">test with param</a> will be resolved to 
> > http://www.domain.com/?test=true
> > 3. for <a href="http://www.yahoo.com\n";>line break!</a>, URL.resolve will 
> > throw exception. And in a browser, it can resolves the URI. (remarks: I 
> > didn't check if this scenario affect the default Tika/NekoHTML parsing. )
> > I suspect there are many different scenarios, many of them are probably 
> > caused by non-standard usage. (but a crawler has to handle non-standard 
> > usage in order to function) Obviously, we cannot cater every case, and I 
> > suggest to consider a resolve failure as a bug if a link works in a Mozilla 
> > browser but not in Droids LinkExtractor. 
> > this issue is related to the LinkExtractor created in DROIDS-8
> 
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)

Re: [jira] Commented: (DROIDS-45) Fail to resovle outlink correctly

Reply via email to