On Tue, Sep 03, 2002 at 07:20:42PM +0530, Rajesh Parekh wrote: > Hi, > > I have a requirement to convert hundreds of unstructured documents in > WORD/PDF/TXT/EMAIL formats into a structured repository of XML Metadata > of the document and the documents itself. > > I need to parse each of these documents and extract the relevant > information to build a XML metadata document for each document. > > The XML structured metadata of the underlying document will contain > fields like Keywords, Category, Doc Name, Author etc.
The simpler metadata could be extracted with regular tools: email (rfc822): 'formail' and 'procmail' are all you'll ever need. txt: perl pdf: pdf2html doc: mswordview (or a .vbs?) Given text files and optionally some basic metadata (dublin-core marked-up HTML for example), there is software that tries to infer additional metadata by analysing the text with some cunning algorithm: http://www.topic.com.au/products/klarity.html Not open source, but cool software anyway. A site using it: http://www.womens.gateway.nsw.gov.au/ (incidentally that site's search backend uses Cocoon 1's LDAP taglib and XSPs to integrate results) > Is it possible to use Cocoon and or POI to do this. And if yes how to > use Cocoon to do the extraction. > > I am new to Cocoon, and trying to understand the world of > transformers/generators etc. > > Also could I use Lucene to index the XML documents and build a search > engine around it. > > I would like to know about the possible ways to do this. Cocoon is a publishing framework. The jobs of metadata extraction and searching are out of it's domain, so perhaps first start with them. --Jeff > regards > > rajesh. > --------------------------------------------------------------------- Please check that your question has not already been answered in the FAQ before posting. <http://xml.apache.org/cocoon/faq/index.html> To unsubscribe, e-mail: <[EMAIL PROTECTED]> For additional commands, e-mail: <[EMAIL PROTECTED]>