Bernhard Geisberger created TIKA-3627:
-----------------------------------------

             Summary: OOXML parsing is not working as intended using multiple 
threads
                 Key: TIKA-3627
                 URL: https://issues.apache.org/jira/browse/TIKA-3627
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.2.0
            Reporter: Bernhard Geisberger


In the latest version, the parsing of OOXML files is broken if multiple threads 
are used. I investigated and compared the call stack between 2.1.0 and 2.2.0, 
and came to the conclusion that this is caused by [this 
commit|https://github.com/apache/tika/commit/10d925439cd862f74679ec5fa9a9b5863f50ce2c]
 in line 86 of OOXMLExtractorFactory.

In version 2.1.0, the call 
`ExtractorFactory.setThreadPrefersEventExtractors(true)` is used in every 
`parse` call, resulting in setting the thread-local property for every thread. 
In version 2.2.0, the call is used in the static block. This leads to the 
property being the default value (=false) for all other threads than the first 
one. Effectively, this breaks the parsing of macros in  OOXML files.

An easy workaround in version 2.2.0 is to call 
`ExtractorFactory.setAllThreadsPreferEventExtractors(true)` at some time before 
tika is used first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to