Manuel, We've run into similar "java.lang.OutOfMemoryError" when larger PDFs (usually greater than 10MB in size) are being stored within DSpace. Usually (at least in our case) these problematic PDFs are also originally scanned and OCRed.
The problem actually doesn't reside within DSpace itself, but within the PDFBox (www.pdfbox.org) open source tool that DSpace uses to extract the full text of PDFs. I've already contacted the person in charge of PDFBox, and logged this as a bug in PDFBox: http://sourceforge.net/tracker/index.php?func=detail&aid=1805929&group_id=78314&atid=552832 Unfortunately, at this point in time, I have NOT received any specific fix to this issue from PDFBox. I'm hoping that a new version of PDFBox will be released soon which fixes the problem, and DSpace can be updated quickly. (As soon as I hear of a fix, I'm planning to push it into the next release of DSpace and get word out to this dspace-tech list) The only way to get around this error happening continually is to *skip* the problematic PDF(s) during the 'filter-media' processing. The ability to skip files will be available in DSpace 1.5 (by passing a new '-s' flag to filter-media). However, I realize this won't be of help with DSpace 1.4.2. If you need a fix for 1.4.2 ASAP, I can create you a quick patch that would provide this "skip" option in 'filter-media' for DSpace 1.4.2 I know this is a *very* frustrating issue as it essentially blocks all new DSpace content from being full-text indexed (since DSpace always does filter-media processing of bitstreams in the same order...and would always error out on that same problematic PDF). - Tim Manuel Antonio Echeverry Uribe wrote: > Hello everybody. > > During a indexing process of one of our instances of Dspace we are getting > this particular error " java.lang.OutOfMemoryError". > Let me show you an resumed example if this. > Currently the dspace is running on a 2GB of RAM server and de Java VM is > allowed to use 500MB of that RAM, and the server is not hevily loaded. > > > > > Applying Media Filters > Using configuration in /home/dspace/dspace/config/mediafilter.cfg > Format: 'Adobe PDF' Filtering Class: 'org.dspace.app.mediafilter.PDFFilter' > Format: 'HTML' Filtering Class: 'org.dspace.app.mediafilter.HTMLFilter' > Format: 'Microsoft Word' Filtering Class: > 'org.dspace.app.mediafilter.WordFilter' > Format: 'Text' Filtering Class: 'org.dspace.app.mediafilter.HTMLFilter' > Format: 'GIF' Filtering Class: 'org.dspace.app.mediafilter.JPEGFilter' > Format: 'JPEG' Filtering Class: 'org.dspace.app.mediafilter.JPEGFilter' > Format: 'image/png' Filtering Class: 'org.dspace.app.mediafilter.JPEGFilter' > SKIPPED: bitstream 140 because 'gbell_prioridad-etica.pdf.txt' already > exists > SKIPPED: bitstream 141 because > 'agarcia_globalizacion-sist-monetario.pdf.txt' already exists > SKIPPED: bitstream 142 because 'alopez_ruta-sostenibilidad.pdf.txt' already > exists > . > . > . > . > . > SKIPPED: bitstream 1456 because > 'Factores_exito_negocios_artesanias_mexico.pdf.txt' already exists > SKIPPED: bitstream 1458 because 'Diseno_sistema_experto_difuso.pdf.txt' > already exists > SKIPPED: bitstream 1460 because 'Tasa_cambio_gerenciable.pdf.txt' already > exists > SKIPPED: bitstream 1462 because > 'Sociedad_colombiana_juegos_apuestas.pdf.txt' already exists > SKIPPED: bitstream 1464 because 'Vol.23_No.104_TC.pdf.txt' already exists > 2007-12-19 11:57:47,087 INFO org.dspace.content.Bundle @ > anonymous::create_bundle:bundle_id=1467 > 2007-12-19 11:57:47,092 INFO org.dspace.content.Bundle @ > anonymous::update_bundle:bundle_id=1467 > 2007-12-19 11:57:47,094 INFO org.dspace.content.Item @ > anonymous::add_bundle:item_id=7,bundle_id=1467 > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > ________________________________________ > Manuel Echeverry > Dirección de servicios y recursos de información > Soporte a Biblioteca > Ext 747 > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech > -- ======================================== Tim Donohue Research Programmer, Illinois Digital Environment for Access to Learning and Scholarship (IDEALS) 135 Grainger Engineering Library University of Illinois at Urbana-Champaign email: [EMAIL PROTECTED] web: http://www.ideals.uiuc.edu phone: (217) 333-4648 fax: (217) 244-7764 ======================================== ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

