Hi, On Tue, Nov 11, 2008 at 10:06 AM, Thomas Müller <[EMAIL PROTECTED]> wrote: > It's an interesting use case, and probably quite common. It would be > good if the text extraction would be run only once for each binary. > However I'm not sure how this should be implemented... One solution is > to extract the text in the data store, but that would be in the > 'wrong' level.
An alternative would be to add an extra stream to binary InternalValues. That stream (if present) would contain the result of text extraction on the binary value and could then be used for indexing. In fact last week at the ApacheCon I was discussing with the Lucene people about a way to store the analyzed token stream to further optimize the re-indexing case. Apparently that should be possible with little effort. The problem with this is that we'd need to move the text extraction functionality down to the persistence or item state layer. The current configuration mechanism we have isn't too well adjusted for this and things like using the value of the jcr:mimeType property to guide text extraction might become quite tricky. But I don't see any fundamental reason why those issues could not be resolved. BR, Jukka Zitting
