In fact, Apache Jakarta POI project is already focusing on accessing microsoft format files. poi-scratchpad-3.0-alpha2 has functions to parse ppt file. We already used them in our project for parsing word, ppt. excel.

Here attaches sample PPTFilter.java, you could put in under org.dspace.app.mediafilter folder. (need to put poi-3.2-alpha2, poi-contrib-3.0-alpha2, poi-scratchpad-3.0-alpha jar files to lib folder from their web site).

Thanks

Guang

Pan Family wrote:
Hi,

I submitted a MS ppt file to my collection, but filter-media
does not want to index this ppt file.  I tried to shut down
the database (PostgreSQL) and restarted it, and ran
filter-media several times, but it did not help.  I made
sure that this ppt file is indeed in the collection by openning
it using View/Open.

I have no problem indexing MS Word, text, html, or pdf
files.  Do I need to do anything special for ppt files?

Thanks a lot!

-Pan


------------------------------------------------------------------------

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
------------------------------------------------------------------------

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

package org.dspace.app.mediafilter;

import java.io.ByteArrayInputStream;
import java.io.InputStream;

import org.apache.poi.hslf.extractor.PowerPointExtractor;
import org.dspace.app.mediafilter.MediaFilter;
import org.dspace.app.mediafilter.MediaFilterManager;

/**
 * Media filter for PPT file. 
 * 
 * @author Guang Huang
 *
 */
public class PPTFilter extends MediaFilter
{

    public String getBundleName()
    {
        return "TEXT";
    }

    public String getDescription()
    {
        return "Extracted text";
    }

    public InputStream getDestinationStream(InputStream source)
            throws Exception
    {
        //commented by Guang Huang
        //?? Here don't need to close powerpoint extractor.
        //Close input stream <code>source</code> will close powerpoint extractor
        String extractedText = new PowerPointExtractor(source).getText();

        // if verbose flag is set, print out extracted text
        // to STDOUT
        if (MediaFilterManager.isVerbose)
        {
            System.out.println(extractedText);
        }

        // generate an input stream with the extracted text
        byte[] textBytes = extractedText.getBytes();
        ByteArrayInputStream bais = new ByteArrayInputStream(textBytes);

        return bais; // will this work? or will the byte array be out of scope?
    }

    public String getFilteredName(String sourceName)
    {
        return sourceName + ".txt";
    }

    public String getFormatString()
    {
        return "Text";
    }

}
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to