Hi Martin,
Edgar Poce wrote:
Martin Chalupka wrote:
what is the best practice for managing searchable binary content (like word- or
pdf-documents) in jackrabbit?
I am thinking about stripping the text with tools like Jakarta Apache POI and
writing it as text content to the repository, with some structure like
would that be the right way?
Apparently, indexing binary values with known mime types is in the todo list.
quote from o.a.j.core.search.lucene.NodeIndexer:
"todo add support for indexing of nt:resource. e.g. when mime type is text"
I recently added support for custom text filter implementations. see: org.apache.jackrabbit.core.query.TextFilterService and interface TextFilter in the same package. The class documentation in the service class describes how you can write your own TextFilter implementation. As an example I have implemented an simple filter that knows how to extract text from a resource with mime-type text/plain ;) see class TextPlainTextFilter (same package as the others)
I think that a configurable way to map text extractors to mime types would be useful. Mime types other than plain/text could be supported. WDYT?
The extension mechanism in place for the text filter functionality does not need configuration, the jar file with the filter class just needs to be in the classpath and declared in: META-INF/services/org.apache.jackrabbit.core.query.TextFilterService of the jar file.
please note that text filters are currently only triggered for nt:resource nodes (or nodes of subtypes of nt:resource). that means your own nodetype must either use a nt:resource (or one of the subtypes) child node or extend it.
hope this helps
regards marcel
