On Tue, Sep 03, 2002 at 07:20:42PM +0530, Rajesh Parekh wrote:
> Hi, 
>  
> I have a requirement to convert hundreds of unstructured documents in
> WORD/PDF/TXT/EMAIL formats into a structured repository of XML Metadata
> of the document and the documents itself. 
>  
> I need to parse each of these documents and extract the relevant
> information to build a XML metadata document for each document. 
>  
> The XML structured metadata of the underlying document will contain
> fields like Keywords, Category, Doc Name, Author etc. 

The simpler metadata could be extracted with regular tools:
 email (rfc822): 'formail' and 'procmail' are all you'll ever need.
 txt: perl
 pdf: pdf2html
 doc: mswordview (or a .vbs?)

Given text files and optionally some basic metadata (dublin-core
marked-up HTML for example), there is software that tries to infer
additional metadata by analysing the text with some cunning algorithm:

http://www.topic.com.au/products/klarity.html

Not open source, but cool software anyway. A site using it:

http://www.womens.gateway.nsw.gov.au/

(incidentally that site's search backend uses Cocoon 1's LDAP taglib and
XSPs to integrate results)

> Is it possible to use Cocoon and or POI to do this.  And if yes how to
> use Cocoon to do the extraction.
>
> I am new to Cocoon, and trying to understand the world of
> transformers/generators etc. 
>  
> Also could I use Lucene to index the XML documents and build a search
> engine around it. 
>  
> I would like to know about the possible ways to do this. 

Cocoon is a publishing framework. The jobs of metadata extraction and
searching are out of it's domain, so perhaps first start with them.


--Jeff


> regards
>  
> rajesh. 
> 

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <[EMAIL PROTECTED]>
For additional commands, e-mail:   <[EMAIL PROTECTED]>

Reply via email to