Dmitry Goldenberg wrote:
Awesome stuff. A few questions: is your Excel extractor somehow
better than POI's? and, what do you see as the timeframe for adding
WordPerfect support? Are you considering supporting any other sources
such as MS Project, Framemaker, etc?
I just committed a WordPerfectExtractor ;)
It's based on code developed in-house at Aduna and it seems to work
quite well on my test collection of WordPerfect documents. Only
sometimes words are split in the middle, I'm still looking into that.
The test set has a bias for older WordPerfect documents though, I'm
trying to get my hands on a recent copy of WordPerfect to see if the
latest format is also supported and to create unit tests for it.
To interactively test the extractor(s) yourselves:
- checkout Aperture from CVS (see
http://sourceforge.net/cvs/?group_id=150969)
- do "ant release"
- go to build\release\bin and execute fileinspector.bat
- drag any file (WordPerfect or any other format) to see what MIME type
Aperture thinks it is and to execute the corresponding Extractor, if
available. The two tabs show the extracted full-text and an RDF dump of
the metadata. For WordPerfect, only full-text extraction is currently
supported.
Our ExcelExtractor is basically nothing more than glue code between POI
and the rest of our framework, meaning that an application using the
framework can request an Extractor implementation for
"application/vnd.ms-excel", feed it an InputStream and get the text and
metadata back.
The only advantage of our ExcelExtractor over direct use of POI is that,
when POI throws an Exception on a particular document, it reverts to a
heuristic string extraction algorithm which is often able to extract
full-text from a document with reasonable quality, i.e. suited for indexing.
We are surely considering supporting more formats. Which ones we will
work on depends on a number of factors, e.g. availability of open source
libs for that format, complexity of the file format (we did WordPerfect
by ourselves), customer demand, code contributions from others, etc. In
any case, if you need support for format XYZ, you can always send me
some example files and I'll take a look at how hard it is to add support
for it.
Chris
--
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]