[
https://issues.apache.org/jira/browse/SOLR-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15539674#comment-15539674
]
Alexandre Rafalovitch commented on SOLR-2886:
---------------------------------------------
Does this happen with the latest version of Solr/Tika? If not or cannot be
reproduced, I suggest closing the case.
> Out of Memory Error with DIH and TikaEntityProcessor
> ----------------------------------------------------
>
> Key: SOLR-2886
> URL: https://issues.apache.org/jira/browse/SOLR-2886
> Project: Solr
> Issue Type: Bug
> Components: contrib - DataImportHandler, contrib - Solr Cell (Tika
> extraction)
> Affects Versions: 4.0-ALPHA
> Reporter: Tricia Jenkins
>
> I've recently upgraded from apache-solr-4.0-2011-06-14_08-33-23.war to
> apache-solr-4.0-2011-10-14_08-56-59.war and then
> apache-solr-4.0-2011-10-30_09-00-00.war to index ~5300 pdfs, of various
> sizes, using the TikaEntityProcessor. My indexing would run to completion
> and was completely successful under the June build. The only error was
> readability of the fulltext in highlighting. This was fixed in Tika 0.10
> (TIKA-611). I chose to use the October 14 build of Solr because Tika 0.10
> had recently been included (SOLR-2372).
> On the same machine without changing any memory settings my initial problem
> is a Perm Gen error. Fine, I increase the PermGen space.
> I've set the "onError" parameter to "skip" for the TikaEntityProcessor. Now
> I get several (6)
> SEVERE: Exception thrown while getting data
> java.net.SocketTimeoutException: Read timed out
> SEVERE: Exception in entity :
> tika:org.apache.solr.handler.dataimport.DataImport
> HandlerException: Exception in invoking url <url removed> # 2975
> pairs. And after ~3881 documents, with auto commit set unreasonably
> frequently I consistently get an Out of Memory Error
> SEVERE: Exception while processing: f document :
> null:org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap space
> The stack trace points to
> org.apache.pdfbox.io.RandomAccessBuffer.expandBuffer(RandomAccessBuffer.java:151)
> and
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718).
> The October 30 build performs identically.
> Funny thing is that monitoring via JConsole doesn't reveal any memory issues.
> Because the out of Memory error did not occur in June, this leads me to
> believe that a bug has been introduced to the code since then.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]