Just an FYI, created and implemented the following class along with needed changes in dspace.cfg and now the Text file *extractor* is working. Just copies input to another stream using current FilterFormat example as well as borrowing code from XPDFtoText. Hopefully someone else may find this useful as well. Haven't uploaded any code before to DSpace so if this is something that would be useful, let me know.
Dspace.cfg updates: Add to plugin.named.org.dspace.app.mediafilter.FormatFilter org.dspace.app.mediafilter.DocxFilter = DocX Extractor, \ Add under "configure each filter's input format" section: filter.org.dspace.app.mediafilter.TextFilter.inputFormats = Text Added TextFilter.java to mediafilter directory and built/deployed: /* * To change this template, choose Tools | Templates * and open the template in the editor. */ package org.dspace.app.mediafilter; import java.io.FileInputStream; import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import org.apache.log4j.Logger; import org.dspace.core.Utils; /** * * @author Tom Autry */ public class TextFilter extends MediaFilter { private static Logger log = Logger.getLogger(TextFilter.class); public String getFilteredName(String oldFilename) { return oldFilename + ".txt"; } /** * @return String bundle name * */ public String getBundleName() { return "TEXT"; } /** * @return String bitstreamformat */ public String getFormatString() { return "Text"; } /** * @return String description */ public String getDescription() { return "Extracted text"; } /** * @param source * source input stream * * @return InputStream the resulting input stream */ public InputStream getDestinationStream(InputStream sourceStream) throws Exception { File sourceTmp = File.createTempFile("DSfilt",".txt"); sourceTmp.deleteOnExit(); // extra insurance, we'll delete it here. int status = -1; try { // make local temp copy of source PDF since PDF tools // require a file for random access. // XXX fixme could optimize if we ever get an interface to grab asset *files* OutputStream sto = new FileOutputStream(sourceTmp); Utils.copy(sourceStream, sto); sto.close(); sourceStream.close(); //create object of BufferedInputStream InputStream stdout = new FileInputStream(sourceTmp); BufferedInputStream bin = new BufferedInputStream(stdout); //create a byte array byte[] contents = new byte[4096]; int bytesRead=0; String strFileContents; ByteArrayOutputStream baos = new ByteArrayOutputStream(); //Utils.copy(new BufferedInputStream(stdout), baos); Utils.copy(bin, baos); stdout.close(); baos.close(); return new ByteArrayInputStream(baos.toByteArray()); } catch (IOException e) { log.error("Failed in to text subprocess: ",e); throw e; } finally { if (!sourceTmp.delete()) { log.error("Unable to delete temporary file"); } if (status != 0) { log.error("Text copy failed, returns=" + status + ", file=" + sourceTmp); } } } } Thanks. Tom Autry Coffing Corporation 3136 Presidential Drive Fairborn, Ohio 45324 Office: 937-458-6100 Cell: 937-361-4680 Email: tom.au...@coffingco.com -----Original Message----- From: Tom Autry [mailto:tom.au...@coffingco.com] Sent: Tuesday, September 18, 2012 10:18 AM To: DSpace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Filter-media on TEXT files Tim, Thanks for the information and I had also come across that information. I forgot to mention that we are using 1.8.2. We were using the HTML extractor but it runs into Java Heap issues on larger TEXT documents and then the FilterMedia just hangs up. This is what initially started this process. Here is the log message for the error. The file is 29MB, which isn't large for a file but large for a TEXT file. It actually took several hours of processing (or failure to process) before it failed with out of memory. Exception: Java heap space java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2894) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532) at java.lang.StringBuffer.append(StringBuffer.java:323) at javax.swing.text.DefaultStyledDocument.create(DefaultStyledDocument.java:143) at javax.swing.text.html.HTMLDocument.create(HTMLDocument.java:472) at javax.swing.text.html.HTMLDocument$HTMLReader.flushBuffer(HTMLDocument.java:3719) at javax.swing.text.html.HTMLDocument$HTMLReader.flush(HTMLDocument.java:2523) at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:263) at javax.swing.text.DefaultEditorKit.read(DefaultEditorKit.java:149) at org.dspace.app.mediafilter.HTMLFilter.getDestinationStream(HTMLFilter.java:71) at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:750) at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:574) at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:511) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:479) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:457) at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCommunity(MediaFilterManager.java:441) at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:347) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:183) I will look into creating a MediaFilter extension that does nothing more than copies the ORIGINAL text document to the TEXT bundle for searching purposes. Thanks for all the information and help as it is greatly appreciated. Tom Autry Coffing Corporation 3136 Presidential Drive Fairborn, Ohio 45324 Office: 937-458-6100 Cell: 937-361-4680 Email: tom.au...@coffingco.com -----Original Message----- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Tuesday, September 18, 2012 9:58 AM To: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Filter-media on TEXT files Currently, only the TEXT bundle is indexed. See: https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/search/DSIndexer.java#L1244 Obviously the ideal scenario here is to also index plain text files directly, but it doesn't look like it works that way. HOWEVER, it is worth noting that plain text files should have their text "extracted" by the HTMLFilter. See the dspace.cfg default settings: https://github.com/DSpace/DSpace/blob/master/dspace/config/dspace.cfg#L410 (Notice that the HTMLFilter is configured to run for HTML format and Text format) So, currently, Text files should be indexed...but the full text of a plain text file is first duplicated to the TEXT bundle before indexing. (Not ideal, but it should work) Does that makes sense? - Tim On 9/18/2012 8:50 AM, helix84 wrote: > On Tue, Sep 18, 2012 at 3:46 PM, Mark H. Wood <mw...@iupui.edu> wrote: >> I don't understand: why would there be any need to extract plain >> text from a bitstream that's already plain text? Just index it. The >> point of text extraction is to create a plain-text bitstream for the >> indexer to digest. > > Mark, does the indexer index text from plain text files in the ORIGINAL > bundle? > > Regards, > ~~helix84 > > ---------------------------------------------------------------------- > -------- > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions will include endpoint security, mobile security and the > latest in malware threats. > http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > DSpace-tech mailing list > DSpace-tech@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-tech > ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech This e-mail message and any attachments may contain legally privileged, confidential or proprietary information. If you are not the intended recipient(s),or the employee or agent responsible for delivery of this message to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this e-mail message is strictly prohibited. If you have received this message in error, please immediately notify the sender and delete this e-mail message from your computer. Any views expressed in this message are those of the individual sender and may not necessarily reflect the views of the company. ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech This e-mail message and any attachments may contain legally privileged, confidential or proprietary information. If you are not the intended recipient(s),or the employee or agent responsible for delivery of this message to the intended recipient(s), you are hereby notified that any dissemination, distribution or copying of this e-mail message is strictly prohibited. If you have received this message in error, please immediately notify the sender and delete this e-mail message from your computer. Any views expressed in this message are those of the individual sender and may not necessarily reflect the views of the company. ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech