Aisha wrote:
Hi,

I try with last releases nutch-2006-10-13.tar.gz and
nutch-2006-10-19.tar.gz,
but the NPE doesn't seem to be fixed, I always have the same exception
message for a lot of document and a lot af format, excel but word and
powerpoint too.....:

2006-10-19 16:41:09,265 WARN  parse.ParseUtil - Unable to successfully parse
content file://C:/docs_a_indexer/test.doc of type application/msword
2006-10-19 16:41:09,265 WARN  fetcher.Fetcher - Error parsing:
file:/C:/docs_a_indexer/test.doc: failed(2,0): Can't be handled as Microsoft
document. org.apache.nutch.parse.msword.FastSavedException: Fast-saved files
are unsupported at this time

Couls you please help me because the volume of rejected document is
large.......

The reason for failure means that you can't parse these files using the lib-parsems plugins, because they use a "fast save" format, which is not supported.

Your only option is to use some other external parser through parse-ext plugin.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to