Re: Index MSOffice Documents

Rainer Klute Mon, 28 Jun 2004 08:25:17 -0700

Am Fr, den 25.06.2004 schrieb Sergiu Gordea um 14:42:
>  We want to make the search to be able to index MSOffice Documents, 
> therefore I was searching for some possibilities to extract the text 
> from this
> documents. I found some examples based on POI library 
> (http://jakarta.apache.org/poi) and I addapted them to our needs.
> The extraction of the text elements from XLS file I think is trustable 
> (the from POI development comunity did a great job with the package that
> work with XSL files). The examples that extract the text from DOC and 
> PPT files are not very general, I think they have problems with the 
> documents
> written with special charsets but they are working just well on the 
> documents I use. I hope someone that has more experience that I have 
> will improve this
> and will a better source code.


Hm, in PPTConverterImpl.java you try to create a property set from the
stream you have encountered:

PropertySetFactory.create(event.getStream());

However, since you have already read bytes from the stream this attempt
will always fail with an HPSFException. Second, if creating the property
set would have succeeded you don't assign the PropertySet instance
created by PropertySetFactory.create() to a variable and thus could not
deal with it any further. Third, you don't even try to read the
properties.

I suggest to either drop the HPSF code fragments from your code or read
and index the properties. The latter might provide a value of its own.

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting GmbH
  Dipl.-Inform.
  Rainer Klute             E-Mail:  [EMAIL PROTECTED]
  K�rner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index MSOffice Documents

Reply via email to