I have a problem with filter-media apparently getting stuck processing
a file. It ends up pegging the CPU at 100% until I kill the process.
I've tried leaving it for a few days to complete, but it never does.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21853 dspace 20 0 411m 292m 8364 S 100 7.2 1782:59 java
27008 dspace 20 0 418m 299m 8368 S 100 7.4 343:01.51 java
r...@ir:~# ps -ef | grep 21853
dspace 21853 21847 99 Jul13 ? 1-05:43:53 java -Xmx256m
-classpath
:/dspace/lib/activation-1.1.jar:/dspace/lib/bcmail-jdk14-136.jar:/dspace/lib/bcprov-jdk14-136.jar:/dspace/lib/commons-cli-1.0.jar:/dspace/lib/commons-codec-1.3.jar:/dspace/lib/commons-collections-3.2.jar:/dspace/lib/commons-dbcp-1.2.2.jar:/dspace/lib/commons-fileupload-1.2.1.jar:/dspace/lib/commons-io-1.4.jar:/dspace/lib/commons-lang-2.2.jar:/dspace/lib/commons-logging-1.0.4.jar:/dspace/lib/commons-logging-1.0.jar:/dspace/lib/commons-pool-1.4.jar:/dspace/lib/dom4j-1.6.1.jar:/dspace/lib/dspace-api-1.5.3-20090716.011317-5.jar:/dspace/lib/dspace-api-1.5.3-SNAPSHOT.jar:/dspace/lib/dspace-api-lang-1.5.2.1.jar:/dspace/lib/embargo-api-1.0.3.jar:/dspace/lib/embargo-dspace-1.0.3.jar:/dspace/lib/fontbox-0.1.0.jar:/dspace/lib/handle-5.3.4.jar:/dspace/lib/handle-6.2.5.02.jar:/dspace/lib/icu4j-3.4.4.jar:/dspace/lib/jargon-1.4.25.jar:/dspace/lib/jaxen-1.1.jar:/dspace/lib/jdom-1.0.jar:/dspace/lib/jempbox-0.2.0.jar:/dspace/lib/log4j-1.2.14.jar:/dspace/lib/lucene-analyzers-2.3.0.jar:/dspace/lib/lucene-core-2.3.0.jar:/dspace/lib/mail-1.4.jar:/dspace/lib/mets-1.5.2.jar:/dspace/lib/oro-2.0.8.jar:/dspace/lib/pdfbox-0.7.3.jar:/dspace/lib/poi-2.5.1-final-20040804.jar:/dspace/lib/postgresql-8.1-408.jdbc3.jar:/dspace/lib/rome-0.8.jar:/dspace/lib/tm-extractors-0.4.jar:/dspace/lib/xalan-2.7.0.jar:/dspace/lib/xercesImpl-2.8.1.jar:/dspace/lib/xml-apis-1.3.02.jar:/dspace/lib/xmlParserAPIs-2.0.2.jar:/dspace/config
org.dspace.app.mediafilter.MediaFilterManager
root 28484 18209 0 07:43 pts/1 00:00:00 grep 21853
r...@ir:~# ps -ef | grep 27008
dspace 27008 27002 99 02:00 ? 05:44:04 java -Xmx256m
-classpath
:/dspace/lib/activation-1.1.jar:/dspace/lib/bcmail-jdk14-136.jar:/dspace/lib/bcprov-jdk14-136.jar:/dspace/lib/commons-cli-1.0.jar:/dspace/lib/commons-codec-1.3.jar:/dspace/lib/commons-collections-3.2.jar:/dspace/lib/commons-dbcp-1.2.2.jar:/dspace/lib/commons-fileupload-1.2.1.jar:/dspace/lib/commons-io-1.4.jar:/dspace/lib/commons-lang-2.2.jar:/dspace/lib/commons-logging-1.0.4.jar:/dspace/lib/commons-logging-1.0.jar:/dspace/lib/commons-pool-1.4.jar:/dspace/lib/dom4j-1.6.1.jar:/dspace/lib/dspace-api-1.5.3-20090716.011317-5.jar:/dspace/lib/dspace-api-1.5.3-SNAPSHOT.jar:/dspace/lib/dspace-api-lang-1.5.2.1.jar:/dspace/lib/embargo-api-1.0.3.jar:/dspace/lib/embargo-dspace-1.0.3.jar:/dspace/lib/fontbox-0.1.0.jar:/dspace/lib/handle-5.3.4.jar:/dspace/lib/handle-6.2.5.02.jar:/dspace/lib/icu4j-3.4.4.jar:/dspace/lib/jargon-1.4.25.jar:/dspace/lib/jaxen-1.1.jar:/dspace/lib/jdom-1.0.jar:/dspace/lib/jempbox-0.2.0.jar:/dspace/lib/log4j-1.2.14.jar:/dspace/lib/lucene-analyzers-2.3.0.jar:/dspace/lib/lucene-core-2.3.0.jar:/dspace/lib/mail-1.4.jar:/dspace/lib/mets-1.5.2.jar:/dspace/lib/oro-2.0.8.jar:/dspace/lib/pdfbox-0.7.3.jar:/dspace/lib/poi-2.5.1-final-20040804.jar:/dspace/lib/postgresql-8.1-408.jdbc3.jar:/dspace/lib/rome-0.8.jar:/dspace/lib/tm-extractors-0.4.jar:/dspace/lib/xalan-2.7.0.jar:/dspace/lib/xercesImpl-2.8.1.jar:/dspace/lib/xml-apis-1.3.02.jar:/dspace/lib/xmlParserAPIs-2.0.2.jar:/dspace/config
org.dspace.app.mediafilter.MediaFilterManager
root 28486 18209 0 07:43 pts/1 00:00:00 grep 27008
I've tried running it manually with the -v switch, but that doesn't
offer me any clues as to the problem bitstream:
dsp...@ir:~$ /dspace/bin/filter-media -v
Applying Media Filters
The following MediaFilters are enabled:
Full Filter Name: org.dspace.app.mediafilter.HTMLFilter
org.dspace.app.mediafilter.HTMLFilter
Full Filter Name: org.dspace.app.mediafilter.WordFilter
org.dspace.app.mediafilter.WordFilter
Full Filter Name: org.dspace.app.mediafilter.JPEGFilter
org.dspace.app.mediafilter.JPEGFilter
Full Filter Name: org.dspace.app.mediafilter.PDFFilter
org.dspace.app.mediafilter.PDFFilter
SKIPPED: bitstream 3640 (item: 10321/287) because
'Matkovich_2004.pdf.txt' already exists
...
SKIPPED: bitstream 2164 (item: 10321/180) because 'TITLE PAGE.pdf.txt'
already exists
ERROR filtering, skipping bitstream:
Item Handle: 10321/460
Bundle Name: ORIGINAL
File Size: 567170
Checksum: 9e17b9fd124ac43b34390203fb164f9c (MD5)
Asset Store: 0
java.io.EOFException: Unexpected end of ZLIB input stream
java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:223)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:141)
at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
at org.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:101)
at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at
org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:141)
at
org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:668)
at
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:570)
at
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:520)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:488)
at
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:427)
at
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
SKIPPED: bitstream 3831 (item: 10321/19) because 'Adam_2005.pdf.txt'
already exists
...
SKIPPED: bitstream 3744 (item: 10321/20) because
'Vaithilingam_2005.pdf.txt' already exists
SKIPPED: bitstream 1198 (item: 10321/91) because 'Zulu_2006.txt' already exists
SKIPPED: bitstream 3768 (item: 10321/21) because
'Nijland_2005.pdf.txt' already exists
These are my settings in dspace.cfg:
#### Media Filter / Format Filter plugins (through PluginManager) ####
# Media/Format Filters help to full-text index content or
# perform automated format conversions
#Names of the enabled MediaFilter or FormatFilter plugins
filter.plugins = PDF Text Extractor, HTML Text Extractor, \
Word Text Extractor, JPEG Thumbnail
# [To enable Branded Preview]: remove last line above, and uncomment 2
lines below
# Word Text Extractor, JPEG Thumbnail, \
# Branded Preview JPEG
#Assign 'human-understandable' names to each filter
plugin.named.org.dspace.app.mediafilter.FormatFilter = \
org.dspace.app.mediafilter.PDFFilter = PDF Text Extractor, \
org.dspace.app.mediafilter.HTMLFilter = HTML Text Extractor, \
org.dspace.app.mediafilter.WordFilter = Word Text Extractor, \
org.dspace.app.mediafilter.JPEGFilter = JPEG Thumbnail, \
org.dspace.app.mediafilter.BrandedPreviewJPEGFilter = Branded Preview JPEG
#Configure each filter's input format(s)
filter.org.dspace.app.mediafilter.PDFFilter.inputFormats = Adobe PDF
filter.org.dspace.app.mediafilter.HTMLFilter.inputFormats = HTML, Text
filter.org.dspace.app.mediafilter.WordFilter.inputFormats = Microsoft Word
filter.org.dspace.app.mediafilter.JPEGFilter.inputFormats = BMP, GIF,
JPEG, image/png
filter.org.dspace.app.mediafilter.BrandedPreviewJPEGFilter.inputFormats
= BMP, GIF, JPEG, image/png
#Custom settings for PDFFilter
# If true, all PDF extractions are written to temp files as they are
indexed...this
# is slower, but helps ensure that PDFBox software DSpace uses doesn't eat up
# all your memory
pdffilter.largepdfs = true
# If true, PDFs which still result in an Out of Memory error from PDFBox
# are skipped over...these problematic PDFs will never be indexed until
# memory usage can be decreased in the PDFBox software
pdffilter.skiponmemoryexception = true
I'm using DSpace 1.5.3 on Ubuntu 8.04.
Any ideas on how I can fix this?
Sean
--
Sean Carte
esAL Library Systems Manager
+27 72 898 8775
+27 31 373 2490
fax: 0866741254
http://esal.dut.ac.za/
------------------------------------------------------------------------------
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech