Tim,

Thanks for the information and I had also come across that information.  I 
forgot to mention that we are using 1.8.2.

We were using the HTML extractor but it runs into Java Heap issues on larger 
TEXT documents and then the FilterMedia just hangs up.  This is what initially 
started this process.  Here is the log message for the error.  The file is 
29MB, which isn't large for a file but large for a TEXT file.  It actually took 
several hours of processing (or failure to process) before it failed with out 
of memory.

Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2894)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
        at java.lang.StringBuffer.append(StringBuffer.java:323)
        at 
javax.swing.text.DefaultStyledDocument.create(DefaultStyledDocument.java:143)
        at javax.swing.text.html.HTMLDocument.create(HTMLDocument.java:472)
        at 
javax.swing.text.html.HTMLDocument$HTMLReader.flushBuffer(HTMLDocument.java:3719)
        at 
javax.swing.text.html.HTMLDocument$HTMLReader.flush(HTMLDocument.java:2523)
        at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:263)
        at javax.swing.text.DefaultEditorKit.read(DefaultEditorKit.java:149)
        at 
org.dspace.app.mediafilter.HTMLFilter.getDestinationStream(HTMLFilter.java:71)
        at 
org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:750)
        at 
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:574)
        at 
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:511)
        at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:479)
        at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:457)
        at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCommunity(MediaFilterManager.java:441)
        at 
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:347)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:183)

I will look into creating a MediaFilter extension that does nothing more than 
copies the ORIGINAL text document to the TEXT bundle for searching purposes.


Thanks for all the information and help as it is greatly appreciated.

Tom Autry
Coffing Corporation
3136 Presidential Drive
Fairborn, Ohio 45324
Office: 937-458-6100
Cell: 937-361-4680
Email: tom.au...@coffingco.com


-----Original Message-----
From: Tim Donohue [mailto:tdono...@duraspace.org]
Sent: Tuesday, September 18, 2012 9:58 AM
To: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Filter-media on TEXT files

Currently, only the TEXT bundle is indexed.

See:
https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/search/DSIndexer.java#L1244

Obviously the ideal scenario here is to also index plain text files directly, 
but it doesn't look like it works that way.

HOWEVER, it is worth noting that plain text files should have their text 
"extracted" by the HTMLFilter.  See the dspace.cfg default settings:
https://github.com/DSpace/DSpace/blob/master/dspace/config/dspace.cfg#L410

(Notice that the HTMLFilter is configured to run for HTML format and Text 
format)

So, currently, Text files should be indexed...but the full text of a plain text 
file is first duplicated to the TEXT bundle before indexing.
(Not ideal, but it should work)

Does that makes sense?

- Tim

On 9/18/2012 8:50 AM, helix84 wrote:
> On Tue, Sep 18, 2012 at 3:46 PM, Mark H. Wood <mw...@iupui.edu> wrote:
>> I don't understand:  why would there be any need to extract plain
>> text from a bitstream that's already plain text?  Just index it.  The
>> point of text extraction is to create a plain-text bitstream for the
>> indexer to digest.
>
> Mark, does the indexer index text from plain text files in the ORIGINAL 
> bundle?
>
> Regards,
> ~~helix84
>
> ----------------------------------------------------------------------
> --------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond.
> Discussions will include endpoint security, mobile security and the
> latest in malware threats.
> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and threat 
landscape has changed and how IT managers can respond. Discussions will include 
endpoint security, mobile security and the latest in malware threats. 
http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

This e-mail message and any attachments may contain legally privileged, 
confidential or proprietary information. If you are not the intended 
recipient(s),or the employee or agent responsible for delivery of this message 
to the intended recipient(s), you are hereby notified that any dissemination, 
distribution or copying of this e-mail message is strictly prohibited. If you 
have received this message in error, please immediately notify the sender and 
delete this e-mail message from your computer. Any views expressed in this 
message are those of the individual sender and may not necessarily reflect the 
views of the company.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to