Hi Andrzej ,
Thank you for your reply,
As I have a lot of raised exception, Could you please have a look at it and
said me if there is a way to solve them :
- Error parsing: file:/C:/doc to index/conges.xls: failed(2,0): Can't be
handled as micrsosoft document.
org.apache.poi.hssf.record.RecordFormatException: Unable to construct record
instance, the following exception occured: null
- Error parsing: file:/C:/docs_a_indexer/doc1/test.doc: failed(2,0): Can't
be handled as micrsosoft document. java.util.NoSuchElementException
- Error parsing: file:/C:/docs_a_indexer/doc3/test.rtf: failed(2,0): Can't
be handled as micrsosoft document. java.io.IOException: Invalid header
signature; read 7015536635646467195, expected -2226271756974174256
- 2006-10-13 17:29:42,343 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: dsp
at java.net.URL.<init>(URL.java:574)
at java.net.URL.<init>(URL.java:464)
at java.net.URL.<init>(URL.java:413)
at
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
at
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
at org.apache.nutch.parse.ms.MSBaseParser.getParse(MSBaseParser.java:84)
at
org.apache.nutch.parse.msword.MSWordParser.getParse(MSWordParser.java:43)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)
In the last error, the string after "unknown protocol: " is not always dsp,
it seems to be different in each case and I don't understand what mean this
string.
Thank you very much.
Best regards,
Aïcha
Aisha wrote:
> Hi,
>
> I try with last releases nutch-2006-10-13.tar.gz and
> nutch-2006-10-19.tar.gz,
> but the NPE doesn't seem to be fixed, I always have the same exception
> message for a lot of document and a lot af format, excel but word and
> powerpoint too.....:
>
> 2006-10-19 16:41:09,265 WARN parse.ParseUtil - Unable to successfully
> parse
> content file://C:/docs_a_indexer/test.doc of type application/msword
> 2006-10-19 16:41:09,265 WARN fetcher.Fetcher - Error parsing:
> file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as
> Microsoft
> document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved
> files
> are unsupported at this time
>
> Couls you please help me because the volume of rejected document is
> large.......
>
The reason for failure means that you can't parse these files using the
lib-parsems plugins, because they use a "fast save" format, which is not
supported.
Your only option is to use some other external parser through parse-ext
plugin.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
View this message in context:
http://www.nabble.com/Problem-parsing-some-MS-Excel---other-formats-%28Office-2003%29-tf2408217.html#a6911914
Sent from the Nutch - Dev mailing list archive at Nabble.com.