Problem during parsing msword document . It fetching properly but parsing is
not working. Please show me the way how can i parse it
-----------------------------------------------------------------------------------------------------------------------------------
Key: NUTCH-157
URL: http://issues.apache.org/jira/browse/NUTCH-157
Project: Nutch
Type: Bug
Versions: 0.7
Environment: windows
Reporter: karamjit
Ms word document not parsing.
Error messages :----------
Page from url Path in fetch
====file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc
060301 173204 fetching
file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc
060301 173204 Parsing
[file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with [EMAIL
PROTECTED]
060301 173204 fetch of
file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc failed with:
java.lang.NoSuchMethodError:
org.apache.poi.hpsf.SummaryInformation.getEditTime()J
060301 173204 Could not clean the content-type [], Reason is
[org.apache.nutch.util.mime.MimeTypeException: The type can not be null or
empty]. Using its raw version...
060301 173204 Parsing
[file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with [EMAIL
PROTECTED]
060301 173205 status: segment 20060301173203, 1 pages, 1 errors, 35840 bytes,
1000 ms
060301 173205 status: 1.0 pages/s, 280.0 kb/s, 35840.0 bytes/page
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira