Re: best practices for searching binary content?

Marcel Reutegger Tue, 10 May 2005 01:33:05 -0700

Hi Martin,

Edgar Poce wrote:

Martin Chalupka wrote:
what is the best practice for managing searchable binary content (like word- or pdf-documents) in jackrabbit? I am thinking about stripping the text with tools like Jakarta Apache POI and writing it as text content to the repository, with some structure like would that be the right way?
Apparently, indexing binary values with known mime types is in the todo list. quote from o.a.j.core.search.lucene.NodeIndexer: "todo add support for indexing of nt:resource. e.g. when mime type is text"


I recently added support for custom text filter implementations. see:
org.apache.jackrabbit.core.query.TextFilterService and interface
TextFilter in the same package. The class documentation in the service
class describes how you can write your own TextFilter implementation. As
an example I have implemented an simple filter that knows how to extract
text from a resource with mime-type text/plain ;) see class
TextPlainTextFilter (same package as the others)

I think that a configurable way to map text extractors to mime types would be useful. Mime types other than plain/text could be supported. WDYT?


The extension mechanism in place for the text filter functionality does
not need configuration, the jar file with the filter class just needs to
be in the classpath and declared in:
META-INF/services/org.apache.jackrabbit.core.query.TextFilterService of
the jar file.

please note that text filters are currently only triggered for
nt:resource nodes (or nodes of subtypes of nt:resource). that means your
own nodetype must either use a nt:resource (or one of the subtypes)
child node or extend it.

hope this helps

regards
 marcel

Re: best practices for searching binary content?

Reply via email to