Aisha wrote:
Hi,
I try with last releases nutch-2006-10-13.tar.gz and
nutch-2006-10-19.tar.gz,
but the NPE doesn't seem to be fixed, I always have the same exception
message for a lot of document and a lot af format, excel but word and
powerpoint too.....:
2006-10-19 16:41:09,265 WARN parse.ParseUtil - Unable to successfully parse
content file://C:/docs_a_indexer/test.doc of type application/msword
2006-10-19 16:41:09,265 WARN fetcher.Fetcher - Error parsing:
file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as Microsoft
document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved files
are unsupported at this time
Couls you please help me because the volume of rejected document is
large.......
The reason for failure means that you can't parse these files using the
lib-parsems plugins, because they use a "fast save" format, which is not
supported.
Your only option is to use some other external parser through parse-ext
plugin.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com