[ http://issues.apache.org/jira/browse/NUTCH-20?page=all ]
     
Jerome Charron closed NUTCH-20:
-------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

Revision 233559 - http://svn.apache.org/viewcvs.cgi?rev=233559&view=rev

* Add utility to extract urls from plain text (thanks to Stephan Strittmatter)
* Uses the OutlinkExtractor in parse plugins PDF, MSWord, Text, RTF, Ext

Note: Take a look at the JSParseFilter in order to use the OutlinkExtractor in 
it.

>  Extract urls from plain texts
> ------------------------------
>
>          Key: NUTCH-20
>          URL: http://issues.apache.org/jira/browse/NUTCH-20
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stefan Grroschupf
>     Priority: Trivial
>      Fix For: 0.8-dev
>  Attachments: OutlinkExtractor.java, OutlinkExtractor.java, 
> OutlinkExtractor.java, TestOutlink.java, TestOutlink.java, patch.txt
>
> Some parsers have no Outlinks returned. E.g. the Word-Parser.
> This class is able to extract (absolute) hyperlinks from a plain String 
> (content)  and generates outlinks from them.
> This would be very usful for parser which have no explicite extraction of 
> hyperlinks.
> Excample:
> Outlink[] links = OutlinkExtractor.getOutlinks("Nutch is located at 
> http://www.apache.org and ...");
> Will return an array of Outlinks containing the one element of 
> "http://www.apache.org";.
> ----
> transfered from: 
> http://sourceforge.net/tracker/index.php?func=detail&aid=1109328&group_id=59548&atid=491356
> submitted  by: Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to