[
https://issues.apache.org/jira/browse/DROIDS-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mingfai Ma updated DROIDS-45:
-----------------------------
Attachment: LinkResolverTests.java
LinkResolver.java
there are some other cases:
- mailto: , news:
- url parameter with space
- unicode characters
I think it is still far from the full list of all special scenarios
attached is my implementation some custom link transformation to handle more
cases. The code could be moved to LinkExtractor if you think it's ok. I don't
use SAX parser so I don't use LinkExtractor. It would be good if the URL/URI
Transformation / resolution could be refactored to a standalone class.
Another thing is I implemented some checking differently. without doing any
benchmark with modern JDK, I believe my approach that uses indexOf and avoid
regex is slightly more efficient.
> Fail to resovle outlink correctly
> ---------------------------------
>
> Key: DROIDS-45
> URL: https://issues.apache.org/jira/browse/DROIDS-45
> Project: Droids
> Issue Type: Bug
> Components: core
> Affects Versions: 0.01
> Reporter: Mingfai Ma
> Attachments: LinkResolver.java, LinkResolverTests.java
>
>
> I've encountered several cases that outlinks are not extracted correctly.
> Most are cause by the use of URI.resolve().
> 1. For a base URI of new URI("http://www.domain.com"), <a
> href="test.html">test.html</a> will be resolved to
> http://www.domain.comtest.html
> 2. For a base URI of new URI("http://www.domain.com/index.php"), <a
> href="?test=true">test with param</a> will be resolved to
> http://www.domain.com/?test=true
> 3. for <a href="http://www.yahoo.com\n">line break!</a>, URL.resolve will
> throw exception. And in a browser, it can resolves the URI. (remarks: I
> didn't check if this scenario affect the default Tika/NekoHTML parsing. )
> I suspect there are many different scenarios, many of them are probably
> caused by non-standard usage. (but a crawler has to handle non-standard usage
> in order to function) Obviously, we cannot cater every case, and I suggest to
> consider a resolve failure as a bug if a link works in a Mozilla browser but
> not in Droids LinkExtractor.
> this issue is related to the LinkExtractor created in DROIDS-8
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.