[jira] [Created] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Clemens Wyss (JIRA) Sun, 22 Dec 2013 06:19:14 -0800

Clemens Wyss created TIKA-1213:
----------------------------------

             Summary: Parsing (extracting content) a single 5Mb pdf file takes 
3minutes
                 Key: TIKA-1213
                 URL: https://issues.apache.org/jira/browse/TIKA-1213
             Project: Tika
          Issue Type: Bug
          Components: parser
         Environment: I guess not relevant (except for the pdf file)
+ Win7 (8G memory)
+ java 6
+ jira 1.5 (and 1.5 snapshot)
            Reporter: Clemens Wyss
            Priority: Critical



When I parse (extract all its content for Lucene) the attached pdf, the 
extraction takes 3minutes. This is very much related to this very file. I have 
others that misbehave alike, though

My (unit testing) code looks alike:
...
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler( -1 );
ParseContext context = new ParseContext();
context.set( Parser.class, parser );
parser.parse( is, handler, metadata, context );
returnValue = handler.toString();
...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (TIKA-1213) Parsing (extracting content) a single 5Mb pdf file takes 3minutes

Reply via email to