WordML files are not converted by MSOffice2Text
-----------------------------------------------
Key: NXP-5590
URL: https://jira.nuxeo.org/browse/NXP-5590
Project: Nuxeo Enterprise Platform
Issue Type: Bug
Affects Versions: 5.3.2
Environment: Nuxeo 5.3.2 running from Tomcat. Windows 7 64 bit.
Reporter: Richard Louapre
Priority: Major
Attachments: 00000001.rar
Microsoft Office Word 2003 XML aka WordML are not converted in plain text by
MSOffice2Text. Here is the full stacktrace when I try to import this file from
the Files tab:
2010-09-08 12:03:17,782 ERROR
[org.nuxeo.ecm.core.storage.sql.coremodel.BinaryTextListener] Error during
MSOffice2Text conversion
org.nuxeo.ecm.core.convert.api.ConversionException: Error during MSOffice2Text
conversion
at
org.nuxeo.ecm.core.convert.plugins.text.extractors.MSOffice2TextConverter.convert(MSOffice2TextConverter.java:59)
at
org.nuxeo.ecm.core.convert.service.ConversionServiceImpl.convert(ConversionServiceImpl.java:171)
at
org.nuxeo.ecm.core.convert.plugins.text.extractors.FullTextConverter.convert(FullTextConverter.java:72)
at
org.nuxeo.ecm.core.convert.service.ConversionServiceImpl.convert(ConversionServiceImpl.java:171)
at
org.nuxeo.ecm.core.storage.sql.coremodel.BinaryTextListener.blobsToText(BinaryTextListener.java:173)
at
org.nuxeo.ecm.core.storage.sql.coremodel.BinaryTextListener.handleEvent(BinaryTextListener.java:140)
at
org.nuxeo.ecm.core.event.impl.AsyncEventExecutor$Job.run(AsyncEventExecutor.java:137)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
at java.lang.Thread.run(Thread.java:595)
Caused by: org.nuxeo.ecm.core.api.WrappedException: Exception:
java.lang.IllegalArgumentException. message: Your InputStream was neither an
OLE2 stream, nor an OOXML stream
at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:88)
at
org.nuxeo.ecm.core.convert.plugins.text.extractors.MSOffice2TextConverter.convert(MSOffice2TextConverter.java:47)
... 9 more
This issue prevent to fulltext search on arabic documents that have been OCR
where the ocr is generated in this format.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets