Post your conf/nutch-site.xml

Aïcha wrote:
> Hi, 
>
> I have a lot of parsing problems when I try to index my directory, about only 
> 50% of files where indexed!!!!
>
> I ask the nutch-dev group but I try in the nutch-user, perhaps somebody had 
> these problems and solved......
>
> I put a list of the main problem the parsing encountred : 
>
>   -  Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be 
> handled as micrsosoft document. 
> org.apache.poi.hssf.record.RecordFormatException: Unable to construct record 
> instance, the following exception occured: null 
>
>   - Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't 
> be handled as micrsosoft document. java.util.NoSuchElementException 
>   
>   - Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't 
> be handled as micrsosoft document. java.io.IOException: Invalid header 
> signature; read 7015536635646467195, expected -2226271756974174256 
>
>   - 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks 
> java.net.MalformedURLException: unknown protocol: dsp 
>         at java.net.URL.<init>(URL.java:574) 
>         at java.net.URL.<init>(URL.java:464) 
>         at java.net.URL.<init>(URL.java:413) 
>         at 
> org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78) 
>         at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35) 
>         at 
> org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
>  
>         at 
> org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84) 
>         at 
> org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43) 
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) 
>         at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276) 
>         at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152) 
>
>
> In the last error, the string after "unknown protocol: " is not always dsp, 
> it seems to be different in each case and I don't understand what mean this 
> string. 
>
> Thank in advance  
>
> Best regards, 
> Aïcha
>
>
>       
>
>       
>               
> ___________________________________________________________________________ 
> Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! 
> Demandez à ceux qui savent sur Yahoo! Questions/Réponses
> http://fr.answers.yahoo.com
>   


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to