Complementando o meu email anterior, acabei de confirmar que o swish-e
indexa sim arquivos do OpenOffice.org. Saiu um artigo na Linuxmagazine
alemã:
http://www.linux-magazin.de/Artikel/ausgabe/2004/04/swish/swish.html
Não sei alemão mas usando o google translate dá para entender algo do que fala
sobre como configurar o swish-e para indexar arquivos OpenOffice:
http://www.google.com/translate?u=http%3A%2F%2Fwww.linux-magazin.de%2FArtikel%2Fausgabe%2F2004%2F04%2Fswish%2Fswish.html&langpair=de%7Cen&hl=en&ie=UTF8
==============================
Open Office documents scan
Open Office stores its files as Zip archives, in which actual contents
are contained always in the XML file "content xml". In order to scan
these documents, somewhat more expenditure is necessary. First the
filter is to be effective to all kinds of open Office files, apply thus
to different Suffixe.
The "IndexContents" directive in listing 5 (line 3) assigns texts,
tables and presentations to the XML format. Somewhat tricky the
"FileFilterMatch" instruction precipitates. It defines the file types
over the regular expression "/\.(sxw|sxc|sxi)$/i "and the Unzip program
assigns, including the call parameter" "- p to them \"%p \" content.xml
"". Thus Unzip extracts the file "content.xml" and passes it on to the
standard output.
Listing 5: Filter for open Office
01 # Open Office
02 FileFilterMatch "/usr/bin/unzip" "-p \"%p\" content.xml" /\.(sxw|sxc|sxi)$/i
03 IndexContents XML* .sxw .sxc .sxi
04 StoreDescription XML* <text:p>
A characteristic is here the line "StoreDescription". Actually this
directive is meant for taking up short description texts to the index
which Swish e indicates with an extended search. Among other things day
is to be indicated here, which contains the description. Even the range
of the description can optional be limited. That does not have to do
anything in the reason with the normal indexing of a XML document.
Practice shows however that Swish e indicates open Office documents
correctly only if this option is indicated. Otherwise the Parser breaks
off frequently too early and leaves a large part of the text unconsidered.
==============================
A vantagem do swish-e é que você pode usá-lo para indexar tudo na sua
rede: arquivos do MS Office antigos, PDFs, emails, arquivos html,
imagens, etc (leia o artigo)
---------------------------------------------------------------------------
Esta lista é patrocinada pela Conectiva S.A. Visite http://www.conectiva.com.br
Arquivo: http://bazar2.conectiva.com.br/mailman/listinfo/linux-br
Regras de utilização da lista: http://linux-br.conectiva.com.br
FAQ: http://www.zago.eti.br/menu.html