[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ]
Stephan Strittmatter updated NUTCH-20:
--------------------------------------
Description:
Some parsers have no Outlinks returned. E.g. the Word-Parser.
This class is able to extract (absolute) hyperlinks from a plain String
(content) and generates outlinks from them.
This would be very usful for parser which have no explicite extraction of
hyperlinks.
Excample:
Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at
http://www.apache.org and ...");
Will return an array of Outlinks containing the one element of
"http://www.apache.org".
----
transfered from:
http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
submitted by: Stephan Strittmatter
was:
transfered from:
http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
submitted by:
Stephan Strittmatter
Some parsers have no Outlinks returned. E.g. the
Word-Parser.
Environment:
> Extract urls from plain texts
> ------------------------------
>
> Key: NUTCH-20
> URL: http://issues.apache.org/jira/browse/NUTCH-20
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Reporter: Stefan Grroschupf
> Priority: Trivial
> Attachments: OutlinkExtractor.java, OutlinkExtractor.java,
> OutlinkExtractor.java, TestOutlink.java, TestOutlink.java, patch.txt
>
> Some parsers have no Outlinks returned. E.g. the Word-Parser.
> This class is able to extract (absolute) hyperlinks from a plain String
> (content) and generates outlinks from them.
> This would be very usful for parser which have no explicite extraction of
> hyperlinks.
> Excample:
> Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at
> http://www.apache.org and ...");
> Will return an array of Outlinks containing the one element of
> "http://www.apache.org".
> ----
> transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
> submitted by: Stephan Strittmatter
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers