[
https://issues.apache.org/jira/browse/TIKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marichi Gupta updated TIKA-2908:
--------------------------------
Description:
I am using Apache Tika on Windows 10, jre 1.8.0_181, and I've imported Tika
using Maven with the following dependencies:
{{<dependencies> <dependency> <groupId>junit</groupId>
<artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope>
</dependency> <dependency> <groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId> <version>1.21</version> </dependency>
</dependencies>}}
I have the code below for performing OCR using Tesseract (which I have
independently tested and know to be working):
public static void OCRTest() {
try {
BufferedImage im = ImageIO.read(new File(OCR_IMAGE));
{{TesseractOCRConfig config = new TesseractOCRConfig();}}
config.setTessdataPath("C:\\Program Files\\Tesseract-OCR\tessdata");
config.setTesseractPath("C:\\Program Files\\Tesseract-OCR");
{{ParseContext parseContext = new ParseContext();}}
parseContext.set(TesseractOCRConfig.class, config);
TesseractOCRParser parser = new TesseractOCRParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
try {
{{parser.parse(im, handler, metadata, parseContext);}}
System.out.println(handler.toString());
} catch (SAXException e)\{ e.printStackTrace(); }
catch (TikaException e) \{ e.printStackTrace(); }
} catch (IOException e)\{ e.printStackTrace(); }
}
I run into the following exception:
org.apache.tika.exception.TikaException: Failed to close temporary resources at
org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:174) at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:251)
at test.test.App.OCRTest(App.java:46) at test.test.App.main(App.java:30)
Caused by: java.nio.file.FileSystemException:
C:\Users\m\AppData\Local\Temp\apache-tika-2643805894084124300.tmp: The process
cannot access the file because it is being used by another process.
The tmp file is present in the Temp folder. I have the source code downloaded
and have stepped through it with the debugger - the error comes from attempting
to close the tmp file. On the Apache Tika forums, there is another post here
(https://issues.apache.org/jira/browse/TIKA-1732) where someone else has run
into the same exception, although with the AutoDetectParser and not Tesseract.
Their issue seemed to be a conflict in their imported jars, but I run into this
issue even with only the Apache Tika libraries installed. I have a feeling this
is a concurrency issue, but I can't pinpoint the conflict.
I don't run into this issue when using the Tika's AutoDetectParser, only with
the TesseractOCRParser. This is an important part of an application I'm working
on, so I would really appreciate any insights on how to proceed.
was:
I am using Apache Tika on Windows 10, jre 1.8.0_181, and I've imported Tika
using Maven with the following dependencies:
{{<dependencies> <dependency> <groupId>junit</groupId>
<artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope>
</dependency> <dependency> <groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId> <version>1.21</version> </dependency>
</dependencies>}}
I have the code below for performing OCR using Tesseract (which I have
independently tested and know to be working):
{{public static void OCRTest() { }}
{{try { }}
{{BufferedImage im = ImageIO.read(new File(OCR_IMAGE)); }}
{{TesseractOCRConfig config = new TesseractOCRConfig();}}
{{ config.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");}}
{{ config.setTesseractPath("C:\\Program Files\\Tesseract-OCR"); }}
{{ParseContext parseContext = new ParseContext();}}
{{ parseContext.set(TesseractOCRConfig.class, config); }}
{{TesseractOCRParser parser = new TesseractOCRParser(); }}
{{BodyContentHandler handler = new BodyContentHandler(); }}
{{Metadata metadata = new Metadata(); }}
{{try { }}
{{parser.parse(im, handler, metadata, parseContext);}}
{{ System.out.println(handler.toString()); }}
{{} catch (SAXException e) { e.printStackTrace(); } }}
{{catch (TikaException e) { e.printStackTrace(); } }}
{{} }}{{catch (IOException e) { e.printStackTrace(); } }}}
I run into the following exception:
{{org.apache.tika.exception.TikaException: Failed to close temporary resources
at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:174)
at
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:251)
at test.test.App.OCRTest(App.java:46) at test.test.App.main(App.java:30)
Caused by: java.nio.file.FileSystemException:
C:\Users\m\AppData\Local\Temp\apache-tika-2643805894084124300.tmp: The process
cannot access the file because it is being used by another process. }}
The tmp file is present in the Temp folder. I have the source code downloaded
and have stepped through it with the debugger - the error comes from attempting
to close the tmp file. On the Apache Tika forums, there is another post here
(https://issues.apache.org/jira/browse/TIKA-1732) where someone else has run
into the same exception, although with the AutoDetectParser and not Tesseract.
Their issue seemed to be a conflict in their imported jars, but I run into this
issue even with only the Apache Tika libraries installed. I have a feeling this
is a concurrency issue, but I can't pinpoint the conflict.
I don't run into this issue when using the Tika's AutoDetectParser, only with
the TesseractOCRParser. This is an important part of an application I'm working
on, so I would really appreciate any insights on how to proceed.
> TikaException: Failed to close temporary resource - how to fix?
> ---------------------------------------------------------------
>
> Key: TIKA-2908
> URL: https://issues.apache.org/jira/browse/TIKA-2908
> Project: Tika
> Issue Type: Bug
> Components: ocr, parser
> Affects Versions: 1.21
> Reporter: Marichi Gupta
> Priority: Blocker
> Labels: ocr, tesseract, tika
>
> I am using Apache Tika on Windows 10, jre 1.8.0_181, and I've imported Tika
> using Maven with the following dependencies:
> {{<dependencies> <dependency> <groupId>junit</groupId>
> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope>
> </dependency> <dependency> <groupId>org.apache.tika</groupId>
> <artifactId>tika-parsers</artifactId> <version>1.21</version> </dependency>
> </dependencies>}}
> I have the code below for performing OCR using Tesseract (which I have
> independently tested and know to be working):
> public static void OCRTest() {
> try {
> BufferedImage im = ImageIO.read(new File(OCR_IMAGE));
> {{TesseractOCRConfig config = new TesseractOCRConfig();}}
> config.setTessdataPath("C:\\Program Files\\Tesseract-OCR\tessdata");
> config.setTesseractPath("C:\\Program Files\\Tesseract-OCR");
> {{ParseContext parseContext = new ParseContext();}}
> parseContext.set(TesseractOCRConfig.class, config);
> TesseractOCRParser parser = new TesseractOCRParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try {
> {{parser.parse(im, handler, metadata, parseContext);}}
> System.out.println(handler.toString());
> } catch (SAXException e)\{ e.printStackTrace(); }
>
> catch (TikaException e) \{ e.printStackTrace(); }
> } catch (IOException e)\{ e.printStackTrace(); }
> }
> I run into the following exception:
> org.apache.tika.exception.TikaException: Failed to close temporary resources
> at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:174)
> at
> org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:251)
> at test.test.App.OCRTest(App.java:46) at test.test.App.main(App.java:30)
> Caused by: java.nio.file.FileSystemException:
> C:\Users\m\AppData\Local\Temp\apache-tika-2643805894084124300.tmp: The
> process cannot access the file because it is being used by another process.
> The tmp file is present in the Temp folder. I have the source code downloaded
> and have stepped through it with the debugger - the error comes from
> attempting to close the tmp file. On the Apache Tika forums, there is another
> post here (https://issues.apache.org/jira/browse/TIKA-1732) where someone
> else has run into the same exception, although with the AutoDetectParser and
> not Tesseract. Their issue seemed to be a conflict in their imported jars,
> but I run into this issue even with only the Apache Tika libraries installed.
> I have a feeling this is a concurrency issue, but I can't pinpoint the
> conflict.
> I don't run into this issue when using the Tika's AutoDetectParser, only with
> the TesseractOCRParser. This is an important part of an application I'm
> working on, so I would really appreciate any insights on how to proceed.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)