WordML files are not converted by MSOffice2Text
-----------------------------------------------

                 Key: NXP-5590
                 URL: https://jira.nuxeo.org/browse/NXP-5590
             Project: Nuxeo Enterprise Platform
          Issue Type: Bug
    Affects Versions: 5.3.2
         Environment: Nuxeo 5.3.2 running from Tomcat. Windows 7 64 bit.
            Reporter: Richard Louapre
            Priority: Major
         Attachments: 00000001.rar

Microsoft Office Word 2003 XML aka WordML are not converted in plain text by 
MSOffice2Text. Here is the full stacktrace when I try to import this file from 
the Files tab:

2010-09-08 12:03:17,782 ERROR 
[org.nuxeo.ecm.core.storage.sql.coremodel.BinaryTextListener] Error during 
MSOffice2Text conversion
org.nuxeo.ecm.core.convert.api.ConversionException: Error during MSOffice2Text 
conversion
        at 
org.nuxeo.ecm.core.convert.plugins.text.extractors.MSOffice2TextConverter.convert(MSOffice2TextConverter.java:59)
        at 
org.nuxeo.ecm.core.convert.service.ConversionServiceImpl.convert(ConversionServiceImpl.java:171)
        at 
org.nuxeo.ecm.core.convert.plugins.text.extractors.FullTextConverter.convert(FullTextConverter.java:72)
        at 
org.nuxeo.ecm.core.convert.service.ConversionServiceImpl.convert(ConversionServiceImpl.java:171)
        at 
org.nuxeo.ecm.core.storage.sql.coremodel.BinaryTextListener.blobsToText(BinaryTextListener.java:173)
        at 
org.nuxeo.ecm.core.storage.sql.coremodel.BinaryTextListener.handleEvent(BinaryTextListener.java:140)
        at 
org.nuxeo.ecm.core.event.impl.AsyncEventExecutor$Job.run(AsyncEventExecutor.java:137)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
        at java.lang.Thread.run(Thread.java:595)
Caused by: org.nuxeo.ecm.core.api.WrappedException: Exception: 
java.lang.IllegalArgumentException. message: Your InputStream was neither an 
OLE2 stream, nor an OOXML stream
        at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:88)
        at 
org.nuxeo.ecm.core.convert.plugins.text.extractors.MSOffice2TextConverter.convert(MSOffice2TextConverter.java:47)
        ... 9 more

This issue prevent to fulltext search on arabic documents that have been OCR 
where the ocr is generated in this format.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

Reply via email to