[Dspace-tech] OpenOffice.org MediaFilter - WAS: DSpace not indexing MS Powerpoint files?

Tim Donohue Thu, 01 Feb 2007 07:46:36 -0800

I definitely appreciate the excitement around the "OpenOffice.org 
MediaFilter" (both here and at OR2007) that I've built for DSpace.
The basis of this custom MediaFilter is a small script I've written 
which uses the OpenOffice.org API (along with a local installation of 
OpenOffice.org software) to automate batch format conversions.

One thing to note though:
Unfortunately, OpenOffice.org software cannot yet convert Powerpoint 
directly to plain text (for text extraction).  Strangely enough, it can 
convert Powerpoint to either HTML or PDF (and then you could extract the 
text from either of those formats for full text indexing).

For those waiting on the code to be available for the OpenOffice.org 
MediaFilter, as mentioned at OR2007 it will be released completely open 
source.   Unfortunately, since others outside of the DSpace community 
have also expressed interest, the non-DSpace-specific code will likely 
be released under its own open source license.  This makes things a 
little complex, since technically UIUC "owns" this code, and I'm going 
to have to jump through the necessary hoops to make it freely available 
to all. :)

I'll also be posting more background details on this work to the DSpace 
Wiki in the next few days (including a link to my OR2007 poster, once I 
have a free moment to submit it to our IR).  Hopefully I can also get 
the code up there in the next week or two.

I'll keep everyone posted!  In the meantime, feel free to contact me if 
you want a version to "play with" :)

- Tim

-- 

========================================
Tim Donohue
Research Programmer, Illinois Digital Environment for
Access to Learning and Scholarship (IDEALS)
135 Grainger Engineering Library
University of Illinois at Urbana-Champaign

email: [EMAIL PROTECTED]
web:   http://ideals.uiuc.edu
phone: (217) 333-4648
fax:   (217) 244-7764
========================================

Mark Diggory wrote:
> Tim Donohue had a a very elegant project/poster on attaching a 
> MediaFilter to an "Open Office" service at OR2007.
> 
> -Mark
> 
> On Feb 1, 2007, at 8:38 AM, Dorothea Salo wrote:
> 
>> Scott Yeadon wrote:
>>> Pan,
>>>
>>> You'll need to write your own media filter class to handle the
>>> extraction of text from PowerPoint files as ppt text extraction isn't
>>> currently supported by the default set of media filters. Hopefully
>>> someone may have already done this and will share, but if not you'll
>>> have to write your own using OpenOffice or some other means.
>>
>>     A quick-and-dirty method might be to save the PPT as a PDF and 
>> ingest the PDF
>> along with it.
>>
>> Dorothea
>>
>> --Dorothea Salo, Digital Repository Services Librarian
>> (703)993-3742     [EMAIL PROTECTED]     AIM: gmumars
>> MSN 2FL, Fenwick Library
>> George Mason University
>> 4400 University Drive, Fairfax VA 22031
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job 
>> easier.
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache 
>> Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> DSpace-tech mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> Mark R. Diggory
> ~~~~~~~~~~~~~
> DSpace Systems Manager
> MIT Libraries, Systems and Technology Services
> Massachusetts Institute of Technology
> 
> 
> 

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

[Dspace-tech] OpenOffice.org MediaFilter - WAS: DSpace not indexing MS Powerpoint files?

Reply via email to