Out of Memory Error with DIH and TikaEntityProcessor
----------------------------------------------------

                 Key: SOLR-2886
                 URL: https://issues.apache.org/jira/browse/SOLR-2886
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika 
extraction)
    Affects Versions: 4.0
            Reporter: Tricia Williams
             Fix For: 4.0


I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to 
apache-solr-4.0-2011-10-14_08-56-59.war and then 
apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various sizes, 
using the TikaEntityProcessor.  My indexing would run to completion and was 
completely successful under the June build.  The only error was readability of 
the fulltext in highlighting.  This was fixed in Tika 0.10 (TIKA-611).  I chose 
to use the October 14 build of Solr because Tika 0.10 had recently been 
included (SOLR-2372).  

On the same machine without changing any memory settings my initial problem is 
a Perm Gen error.  Fine, I increase the PermGen space.

I've set the "onError" parameter to "skip" for the TikaEntityProcessor.  Now I 
get several (6)

SEVERE: Exception thrown while getting data
java.net.SocketTimeoutException: Read timed out
SEVERE: Exception in entity : tika:org.apache.solr.handler.dataimport.DataImport
HandlerException: Exception in invoking url <url removed> # 2975

pairs.  And after ~3881 documents, with auto commit set unreasonably frequently 
I consistently get an Out of Memory Error 

SEVERE: Exception while processing: f document : 
null:org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.OutOfMemoryError: Java heap space

The stack trace points to 
org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
 and 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).

The October 30 build performs identically.

Funny thing is that monitoring via JConsole doesn't reveal any memory issues.

Because the out of Memory error did not occur in June, this leads me to believe 
that a bug has been introduced to the code since then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to