Dmitry Goldenberg wrote:
Awesome stuff. A few questions: is your Excel extractor somehow
better than POI's? and, what do you see as the timeframe for adding
WordPerfect support? Are you considering supporting any other sources
such as MS Project, Framemaker, etc?

I just committed a WordPerfectExtractor ;)

It's based on code developed in-house at Aduna and it seems to work quite well on my test collection of WordPerfect documents. Only sometimes words are split in the middle, I'm still looking into that.

The test set has a bias for older WordPerfect documents though, I'm trying to get my hands on a recent copy of WordPerfect to see if the latest format is also supported and to create unit tests for it.

To interactively test the extractor(s) yourselves:

- checkout Aperture from CVS (see http://sourceforge.net/cvs/?group_id=150969)
- do "ant release"
- go to build\release\bin and execute fileinspector.bat
- drag any file (WordPerfect or any other format) to see what MIME type Aperture thinks it is and to execute the corresponding Extractor, if available. The two tabs show the extracted full-text and an RDF dump of the metadata. For WordPerfect, only full-text extraction is currently supported.

Our ExcelExtractor is basically nothing more than glue code between POI and the rest of our framework, meaning that an application using the framework can request an Extractor implementation for "application/vnd.ms-excel", feed it an InputStream and get the text and metadata back.

The only advantage of our ExcelExtractor over direct use of POI is that, when POI throws an Exception on a particular document, it reverts to a heuristic string extraction algorithm which is often able to extract full-text from a document with reasonable quality, i.e. suited for indexing.

We are surely considering supporting more formats. Which ones we will work on depends on a number of factors, e.g. availability of open source libs for that format, complexity of the file format (we did WordPerfect by ourselves), customer demand, code contributions from others, etc. In any case, if you need support for format XYZ, you can always send me some example files and I'll take a look at how hard it is to add support for it.


Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to