Manuel,

We've run into similar "java.lang.OutOfMemoryError" when larger PDFs 
(usually greater than 10MB in size) are being stored within DSpace. 
Usually (at least in our case) these problematic PDFs are also 
originally scanned and OCRed.

The problem actually doesn't reside within DSpace itself, but within the 
PDFBox (www.pdfbox.org) open source tool that DSpace uses to extract the 
full text of PDFs.   I've already contacted the person in charge of 
PDFBox, and logged this as a bug in PDFBox:
http://sourceforge.net/tracker/index.php?func=detail&aid=1805929&group_id=78314&atid=552832

Unfortunately, at this point in time, I have NOT received any specific 
fix to this issue from PDFBox.  I'm hoping that a new version of PDFBox 
will be released soon which fixes the problem, and DSpace can be updated 
quickly. (As soon as I hear of a fix, I'm planning to push it into the 
next release of DSpace and get word out to this dspace-tech list)

The only way to get around this error happening continually is to *skip* 
the problematic PDF(s) during the 'filter-media' processing.  The 
ability to skip files will be available in DSpace 1.5 (by passing a new 
'-s' flag to filter-media).  However, I realize this won't be of help 
with DSpace 1.4.2.  If you need a fix for 1.4.2 ASAP, I can create you a 
quick patch that would provide this "skip" option in 'filter-media' for 
DSpace 1.4.2

I know this is a *very* frustrating issue as it essentially blocks all 
new DSpace content from being full-text indexed (since DSpace always 
does filter-media processing of bitstreams in the same order...and would 
always error out on that same problematic PDF).

- Tim

Manuel Antonio Echeverry Uribe wrote:
> Hello everybody.
> 
> During a indexing process of one of our instances of Dspace we are getting
> this particular error " java.lang.OutOfMemoryError".
> Let me show you an resumed example if this.
> Currently the dspace is running on a 2GB of RAM server and de Java VM is
> allowed to use 500MB of that RAM, and the server is not hevily loaded.
> 
> 
> 
> 
> Applying Media Filters
> Using configuration in /home/dspace/dspace/config/mediafilter.cfg
> Format: 'Adobe PDF' Filtering Class: 'org.dspace.app.mediafilter.PDFFilter'
> Format: 'HTML' Filtering Class: 'org.dspace.app.mediafilter.HTMLFilter'
> Format: 'Microsoft Word' Filtering Class:
> 'org.dspace.app.mediafilter.WordFilter'
> Format: 'Text' Filtering Class: 'org.dspace.app.mediafilter.HTMLFilter'
> Format: 'GIF' Filtering Class: 'org.dspace.app.mediafilter.JPEGFilter'
> Format: 'JPEG' Filtering Class: 'org.dspace.app.mediafilter.JPEGFilter'
> Format: 'image/png' Filtering Class: 'org.dspace.app.mediafilter.JPEGFilter'
> SKIPPED: bitstream 140 because 'gbell_prioridad-etica.pdf.txt' already
> exists
> SKIPPED: bitstream 141 because
> 'agarcia_globalizacion-sist-monetario.pdf.txt' already exists
> SKIPPED: bitstream 142 because 'alopez_ruta-sostenibilidad.pdf.txt' already
> exists
> .
> .
> .
> .
> .
> SKIPPED: bitstream 1456 because
> 'Factores_exito_negocios_artesanias_mexico.pdf.txt' already exists
> SKIPPED: bitstream 1458 because 'Diseno_sistema_experto_difuso.pdf.txt'
> already exists
> SKIPPED: bitstream 1460 because 'Tasa_cambio_gerenciable.pdf.txt' already
> exists
> SKIPPED: bitstream 1462 because
> 'Sociedad_colombiana_juegos_apuestas.pdf.txt' already exists
> SKIPPED: bitstream 1464 because 'Vol.23_No.104_TC.pdf.txt' already exists
> 2007-12-19 11:57:47,087 INFO  org.dspace.content.Bundle @
> anonymous::create_bundle:bundle_id=1467
> 2007-12-19 11:57:47,092 INFO  org.dspace.content.Bundle @
> anonymous::update_bundle:bundle_id=1467
> 2007-12-19 11:57:47,094 INFO  org.dspace.content.Item @
> anonymous::add_bundle:item_id=7,bundle_id=1467
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 
> ________________________________________
> Manuel Echeverry
> Dirección de servicios y recursos de información
> Soporte a Biblioteca
> Ext 747
> 
> 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 

-- 

========================================
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
135 Grainger Engineering Library
University of Illinois at Urbana-Champaign

email: [EMAIL PROTECTED]
web:   http://www.ideals.uiuc.edu
phone: (217) 333-4648
fax:   (217) 244-7764
========================================

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to