We had tons of problems with filter-media until we switched from using PDFBox to XPDF. With PDFBox ours used to hang too and take 4-EVER to run. Since we've switched over, our filter-media takes a fraction of the time to complete and 100% of our documents filter, except for those that truly are corrupt.
Take a look at http://www.foolabs.com/xpdf/index.html. Also Google "xpdf AND dspace" and you'll find detailed instructions on how to implement it. Btw, we are currently running DSpace 1.5.1. Good luck, Sue -----Original Message----- From: Sean Carte [mailto:[email protected]] Sent: Wednesday, July 14, 2010 2:17 AM To: dspace-tech Subject: [Dspace-tech] filter-media hanging I have a problem with filter-media apparently getting stuck processing a file. It ends up pegging the CPU at 100% until I kill the process. I've tried leaving it for a few days to complete, but it never does. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21853 dspace 20 0 411m 292m 8364 S 100 7.2 1782:59 java 27008 dspace 20 0 418m 299m 8368 S 100 7.4 343:01.51 java r...@ir:~# ps -ef | grep 21853 dspace 21853 21847 99 Jul13 ? 1-05:43:53 java -Xmx256m -classpath :/dspace/lib/activation-1.1.jar:/dspace/lib/bcmail-jdk14-136.jar:/dspace/lib/bcprov-jdk14-136.jar:/dspace/lib/commons-cli-1.0.jar:/dspace/lib/commons-codec-1.3.jar:/dspace/lib/commons-collections-3.2.jar:/dspace/lib/commons-dbcp-1.2.2.jar:/dspace/lib/commons-fileupload-1.2.1.jar:/dspace/lib/commons-io-1.4.jar:/dspace/lib/commons-lang-2.2.jar:/dspace/lib/commons-logging-1.0.4.jar:/dspace/lib/commons-logging-1.0.jar:/dspace/lib/commons-pool-1.4.jar:/dspace/lib/dom4j-1.6.1.jar:/dspace/lib/dspace-api-1.5.3-20090716.011317-5.jar:/dspace/lib/dspace-api-1.5.3-SNAPSHOT.jar:/dspace/lib/dspace-api-lang-1.5.2.1.jar:/dspace/lib/embargo-api-1.0.3.jar:/dspace/lib/embargo-dspace-1.0.3.jar:/dspace/lib/fontbox-0.1.0.jar:/dspace/lib/handle-5.3.4.jar:/dspace/lib/handle-6.2.5.02.jar:/dspace/lib/icu4j-3.4.4.jar:/dspace/lib/jargon-1.4.25.jar:/dspace/lib/jaxen-1.1.jar:/dspace/lib/jdom-1.0.jar:/dspace/lib/jempbox-0.2.0.jar:/dspace/lib/log4j-1.2.14.jar:/dspace/lib/lucene-analyzers-2.3.0.ja r:/dspace/lib/lucene-core-2.3.0.jar:/dspace/lib/mail-1.4.jar:/dspace/lib/mets-1.5.2.jar:/dspace/lib/oro-2.0.8.jar:/dspace/lib/pdfbox-0.7.3.jar:/dspace/lib/poi-2.5.1-final-20040804.jar:/dspace/lib/postgresql-8.1-408.jdbc3.jar:/dspace/lib/rome-0.8.jar:/dspace/lib/tm-extractors-0.4.jar:/dspace/lib/xalan-2.7.0.jar:/dspace/lib/xercesImpl-2.8.1.jar:/dspace/lib/xml-apis-1.3.02.jar:/dspace/lib/xmlParserAPIs-2.0.2.jar:/dspace/config org.dspace.app.mediafilter.MediaFilterManager root 28484 18209 0 07:43 pts/1 00:00:00 grep 21853 r...@ir:~# ps -ef | grep 27008 dspace 27008 27002 99 02:00 ? 05:44:04 java -Xmx256m -classpath :/dspace/lib/activation-1.1.jar:/dspace/lib/bcmail-jdk14-136.jar:/dspace/lib/bcprov-jdk14-136.jar:/dspace/lib/commons-cli-1.0.jar:/dspace/lib/commons-codec-1.3.jar:/dspace/lib/commons-collections-3.2.jar:/dspace/lib/commons-dbcp-1.2.2.jar:/dspace/lib/commons-fileupload-1.2.1.jar:/dspace/lib/commons-io-1.4.jar:/dspace/lib/commons-lang-2.2.jar:/dspace/lib/commons-logging-1.0.4.jar:/dspace/lib/commons-logging-1.0.jar:/dspace/lib/commons-pool-1.4.jar:/dspace/lib/dom4j-1.6.1.jar:/dspace/lib/dspace-api-1.5.3-20090716.011317-5.jar:/dspace/lib/dspace-api-1.5.3-SNAPSHOT.jar:/dspace/lib/dspace-api-lang-1.5.2.1.jar:/dspace/lib/embargo-api-1.0.3.jar:/dspace/lib/embargo-dspace-1.0.3.jar:/dspace/lib/fontbox-0.1.0.jar:/dspace/lib/handle-5.3.4.jar:/dspace/lib/handle-6.2.5.02.jar:/dspace/lib/icu4j-3.4.4.jar:/dspace/lib/jargon-1.4.25.jar:/dspace/lib/jaxen-1.1.jar:/dspace/lib/jdom-1.0.jar:/dspace/lib/jempbox-0.2.0.jar:/dspace/lib/log4j-1.2.14.jar:/dspace/lib/lucene-analyzers-2.3.0.ja r:/dspace/lib/lucene-core-2.3.0.jar:/dspace/lib/mail-1.4.jar:/dspace/lib/mets-1.5.2.jar:/dspace/lib/oro-2.0.8.jar:/dspace/lib/pdfbox-0.7.3.jar:/dspace/lib/poi-2.5.1-final-20040804.jar:/dspace/lib/postgresql-8.1-408.jdbc3.jar:/dspace/lib/rome-0.8.jar:/dspace/lib/tm-extractors-0.4.jar:/dspace/lib/xalan-2.7.0.jar:/dspace/lib/xercesImpl-2.8.1.jar:/dspace/lib/xml-apis-1.3.02.jar:/dspace/lib/xmlParserAPIs-2.0.2.jar:/dspace/config org.dspace.app.mediafilter.MediaFilterManager root 28486 18209 0 07:43 pts/1 00:00:00 grep 27008 I've tried running it manually with the -v switch, but that doesn't offer me any clues as to the problem bitstream: dsp...@ir:~$ /dspace/bin/filter-media -v Applying Media Filters The following MediaFilters are enabled: Full Filter Name: org.dspace.app.mediafilter.HTMLFilter org.dspace.app.mediafilter.HTMLFilter Full Filter Name: org.dspace.app.mediafilter.WordFilter org.dspace.app.mediafilter.WordFilter Full Filter Name: org.dspace.app.mediafilter.JPEGFilter org.dspace.app.mediafilter.JPEGFilter Full Filter Name: org.dspace.app.mediafilter.PDFFilter org.dspace.app.mediafilter.PDFFilter SKIPPED: bitstream 3640 (item: 10321/287) because 'Matkovich_2004.pdf.txt' already exists ... SKIPPED: bitstream 2164 (item: 10321/180) because 'TITLE PAGE.pdf.txt' already exists ERROR filtering, skipping bitstream: Item Handle: 10321/460 Bundle Name: ORIGINAL File Size: 567170 Checksum: 9e17b9fd124ac43b34390203fb164f9c (MD5) Asset Store: 0 java.io.EOFException: Unexpected end of ZLIB input stream java.io.EOFException: Unexpected end of ZLIB input stream at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:141) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:668) at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:570) at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:520) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:488) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:427) at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359) SKIPPED: bitstream 3831 (item: 10321/19) because 'Adam_2005.pdf.txt' already exists ... SKIPPED: bitstream 3744 (item: 10321/20) because 'Vaithilingam_2005.pdf.txt' already exists SKIPPED: bitstream 1198 (item: 10321/91) because 'Zulu_2006.txt' already exists SKIPPED: bitstream 3768 (item: 10321/21) because 'Nijland_2005.pdf.txt' already exists These are my settings in dspace.cfg: #### Media Filter / Format Filter plugins (through PluginManager) #### # Media/Format Filters help to full-text index content or # perform automated format conversions #Names of the enabled MediaFilter or FormatFilter plugins filter.plugins = PDF Text Extractor, HTML Text Extractor, \ Word Text Extractor, JPEG Thumbnail # [To enable Branded Preview]: remove last line above, and uncomment 2 lines below # Word Text Extractor, JPEG Thumbnail, \ # Branded Preview JPEG #Assign 'human-understandable' names to each filter plugin.named.org.dspace.app.mediafilter.FormatFilter = \ org.dspace.app.mediafilter.PDFFilter = PDF Text Extractor, \ org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \ org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \ org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \ org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview JPEG #Configure each filter's input format(s) filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF filter.org.dspace.app.mediafilter.HTMLFilter.inputFormats = HTML, Text filter.org.dspace.app.mediafilter.WordFilter.inputFormats = Microsoft Word filter.org.dspace.app.mediafilter.JPEGFilter.inputFormats = BMP, GIF, JPEG, image/png filter.org.dspace.app.mediafilter.BrandedPreviewJPEGFilter.inputFormats = BMP, GIF, JPEG, image/png #Custom settings for PDFFilter # If true, all PDF extractions are written to temp files as they are indexed...this # is slower, but helps ensure that PDFBox software DSpace uses doesn't eat up # all your memory pdffilter.largepdfs = true # If true, PDFs which still result in an Out of Memory error from PDFBox # are skipped over...these problematic PDFs will never be indexed until # memory usage can be decreased in the PDFBox software pdffilter.skiponmemoryexception = true I'm using DSpace 1.5.3 on Ubuntu 8.04. Any ideas on how I can fix this? Sean -- Sean Carte esAL Library Systems Manager +27 72 898 8775 +27 31 373 2490 fax: 0866741254 http://esal.dut.ac.za/ ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

