Hi guys,
I have been running successfully recently with most of the plug-ins enabled.
Lately, I have been trying to index some xml files which has some strings in
the form of ftawi:xyz.
Nutch version 8.2-dev on MS Windows Server 2003
During Outlinks extractor I get the following errors:
2007-04-17 21:52:51,598 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: ftawi
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at java.net.URL.<init>(Unknown Source)
at
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78
)
at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:11
1)
at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70
)
at
org.apache.nutch.parse.stellent.StellentParser.getParse(StellentParser.java:
53)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:283)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)
I get the same error with all the parser plug-ins when running over the same
xml files. Can you let me know if there is a way of using the regular
expression to let the application know what kind of url should be included
in the url. Also, Nutch should not crash if the url in the outlink is not
valid. Is there any other HTML parser in Nutch that I can try.
Awaiting your kind reply.
Regards,
Armel
===========================
Armel T. Nene
iDNA Solutions LTD
Tel: +44 (20) 7257 6124
Mobile: +44 (7886)950 483
Web: <http://www.idna-solutions.com> http://www.idna-solutions.com
Blog: <http://blog.idna-solutions.com> http://blog.idna-solutions.com
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers