Just an FYI, created and implemented the following class along with needed 
changes in dspace.cfg and now the Text file *extractor* is working.  Just 
copies input to another stream using current FilterFormat example as well as 
borrowing code from XPDFtoText.  Hopefully someone else may find this useful as 
well.  Haven't uploaded any code before to DSpace so if this is something that 
would be useful, let me know.

Dspace.cfg updates:

Add to plugin.named.org.dspace.app.mediafilter.FormatFilter
                org.dspace.app.mediafilter.DocxFilter = DocX Extractor, \

Add under "configure each filter's input format" section:

filter.org.dspace.app.mediafilter.TextFilter.inputFormats = Text

Added TextFilter.java to mediafilter directory and built/deployed:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package org.dspace.app.mediafilter;

import java.io.FileInputStream;

import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import org.apache.log4j.Logger;
import org.dspace.core.Utils;

/**
 *
 * @author Tom Autry
 */
public class TextFilter extends MediaFilter
{
    private static Logger log = Logger.getLogger(TextFilter.class);

     public String getFilteredName(String oldFilename)
    {
        return oldFilename + ".txt";
    }

    /**
     * @return String bundle name
     *
     */
    public String getBundleName()
    {
        return "TEXT";
    }

    /**
     * @return String bitstreamformat
     */
    public String getFormatString()
    {
        return "Text";
    }

    /**
     * @return String description
     */
    public String getDescription()
    {
        return "Extracted text";
    }


     /**
     * @param source
     *            source input stream
     *
     * @return InputStream the resulting input stream
     */
    public InputStream getDestinationStream(InputStream sourceStream)
            throws Exception
    {

        File sourceTmp = File.createTempFile("DSfilt",".txt");
        sourceTmp.deleteOnExit();  // extra insurance, we'll delete it here.
        int status = -1;
        try
        {
            // make local temp copy of source PDF since PDF tools
            // require a file for random access.
            // XXX fixme could optimize if we ever get an interface to grab 
asset *files*
            OutputStream sto = new FileOutputStream(sourceTmp);
            Utils.copy(sourceStream, sto);
            sto.close();
            sourceStream.close();

            //create object of BufferedInputStream
            InputStream stdout = new FileInputStream(sourceTmp);
            BufferedInputStream bin = new BufferedInputStream(stdout);
            //create a byte array
            byte[] contents = new byte[4096];

            int bytesRead=0;
            String strFileContents;

            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            //Utils.copy(new BufferedInputStream(stdout), baos);
            Utils.copy(bin, baos);
            stdout.close();
            baos.close();

            return new ByteArrayInputStream(baos.toByteArray());
        }
        catch (IOException e)
        {
            log.error("Failed in to text subprocess: ",e);
            throw e;
        }
        finally
        {
            if (!sourceTmp.delete())
            {
                log.error("Unable to delete temporary file");
            }
            if (status != 0)
            {
                log.error("Text copy failed, returns=" + status + ", file=" + 
sourceTmp);
            }
        }

    }

}

Thanks.

Tom Autry
Coffing Corporation
3136 Presidential Drive
Fairborn, Ohio 45324
Office: 937-458-6100
Cell: 937-361-4680
Email: tom.au...@coffingco.com


-----Original Message-----
From: Tom Autry [mailto:tom.au...@coffingco.com]
Sent: Tuesday, September 18, 2012 10:18 AM
To: DSpace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Filter-media on TEXT files

Tim,

Thanks for the information and I had also come across that information.  I 
forgot to mention that we are using 1.8.2.

We were using the HTML extractor but it runs into Java Heap issues on larger 
TEXT documents and then the FilterMedia just hangs up.  This is what initially 
started this process.  Here is the log message for the error.  The file is 
29MB, which isn't large for a file but large for a TEXT file.  It actually took 
several hours of processing (or failure to process) before it failed with out 
of memory.

Exception: Java heap space
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2894)
        at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:532)
        at java.lang.StringBuffer.append(StringBuffer.java:323)
        at 
javax.swing.text.DefaultStyledDocument.create(DefaultStyledDocument.java:143)
        at javax.swing.text.html.HTMLDocument.create(HTMLDocument.java:472)
        at 
javax.swing.text.html.HTMLDocument$HTMLReader.flushBuffer(HTMLDocument.java:3719)
        at 
javax.swing.text.html.HTMLDocument$HTMLReader.flush(HTMLDocument.java:2523)
        at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:263)
        at javax.swing.text.DefaultEditorKit.read(DefaultEditorKit.java:149)
        at 
org.dspace.app.mediafilter.HTMLFilter.getDestinationStream(HTMLFilter.java:71)
        at 
org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:750)
        at 
org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:574)
        at 
org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:511)
        at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:479)
        at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:457)
        at 
org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCommunity(MediaFilterManager.java:441)
        at 
org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:347)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:183)

I will look into creating a MediaFilter extension that does nothing more than 
copies the ORIGINAL text document to the TEXT bundle for searching purposes.


Thanks for all the information and help as it is greatly appreciated.

Tom Autry
Coffing Corporation
3136 Presidential Drive
Fairborn, Ohio 45324
Office: 937-458-6100
Cell: 937-361-4680
Email: tom.au...@coffingco.com


-----Original Message-----
From: Tim Donohue [mailto:tdono...@duraspace.org]
Sent: Tuesday, September 18, 2012 9:58 AM
To: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Filter-media on TEXT files

Currently, only the TEXT bundle is indexed.

See:
https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/search/DSIndexer.java#L1244

Obviously the ideal scenario here is to also index plain text files directly, 
but it doesn't look like it works that way.

HOWEVER, it is worth noting that plain text files should have their text 
"extracted" by the HTMLFilter.  See the dspace.cfg default settings:
https://github.com/DSpace/DSpace/blob/master/dspace/config/dspace.cfg#L410

(Notice that the HTMLFilter is configured to run for HTML format and Text 
format)

So, currently, Text files should be indexed...but the full text of a plain text 
file is first duplicated to the TEXT bundle before indexing.
(Not ideal, but it should work)

Does that makes sense?

- Tim

On 9/18/2012 8:50 AM, helix84 wrote:
> On Tue, Sep 18, 2012 at 3:46 PM, Mark H. Wood <mw...@iupui.edu> wrote:
>> I don't understand:  why would there be any need to extract plain
>> text from a bitstream that's already plain text?  Just index it.  The
>> point of text extraction is to create a plain-text bitstream for the
>> indexer to digest.
>
> Mark, does the indexer index text from plain text files in the ORIGINAL 
> bundle?
>
> Regards,
> ~~helix84
>
> ----------------------------------------------------------------------
> --------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond.
> Discussions will include endpoint security, mobile security and the
> latest in malware threats.
> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and threat 
landscape has changed and how IT managers can respond. Discussions will include 
endpoint security, mobile security and the latest in malware threats. 
http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

This e-mail message and any attachments may contain legally privileged, 
confidential or proprietary information. If you are not the intended 
recipient(s),or the employee or agent responsible for delivery of this message 
to the intended recipient(s), you are hereby notified that any dissemination, 
distribution or copying of this e-mail message is strictly prohibited. If you 
have received this message in error, please immediately notify the sender and 
delete this e-mail message from your computer. Any views expressed in this 
message are those of the individual sender and may not necessarily reflect the 
views of the company.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and threat 
landscape has changed and how IT managers can respond. Discussions will include 
endpoint security, mobile security and the latest in malware threats. 
http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

This e-mail message and any attachments may contain legally privileged, 
confidential or proprietary information. If you are not the intended 
recipient(s),or the employee or agent responsible for delivery of this message 
to the intended recipient(s), you are hereby notified that any dissemination, 
distribution or copying of this e-mail message is strictly prohibited. If you 
have received this message in error, please immediately notify the sender and 
delete this e-mail message from your computer. Any views expressed in this 
message are those of the individual sender and may not necessarily reflect the 
views of the company.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to