[jira] Commented: (JCR-415) Enhance indexing of binary content

Marcel Reutegger (JIRA) Tue, 11 Jul 2006 05:27:28 -0700

    [ 
http://issues.apache.org/jira/browse/JCR-415?page=comments#action_12420293 ]


Marcel Reutegger commented on JCR-415:
--------------------------------------

Jukka wrote:
> I think it would make more design sense to try to postpone the creation of 
> the Document instances
> instead of delaying text extraction. But I'm not too familiar with the 
> details, so I'm OK with adding lazy
> reading to the mix. In any case I think it's best to layer the lazy reading 
> on top of the TextExtractor interface
> instead of below it. A utility class like the following could achieve this as 
> long as the given InputStream
> remains valid until the document has been read.

Yes, you are right. I thought I could get away with the dirty solution ;)
While going through your patch I was actually also thinking about a design that 
should create the document
only when it is really added to the index.
For now we can maybe use the TextExtractorReader you proposed and then in a 
next step change the design
to create the Document in a later stage of the indexing process.

> Enhance indexing of binary content
> ----------------------------------
>
>          Key: JCR-415
>          URL: http://issues.apache.org/jira/browse/JCR-415
>      Project: Jackrabbit
>         Type: Improvement

>   Components: indexing
>     Versions: 1.0, 1.0.1, 0.9
>     Reporter: Marcel Reutegger
>     Priority: Minor
>      Fix For: 1.1
>  Attachments: jackrabbit-extractor-r420472.patch, 
> jackrabbit-query-r420472.patch, 
> org.apache.jackrabbit.core.query-extractor.jpg, 
> org.apache.jackrabbit.core.query.lucene-extractor.jpg, 
> org.apache.jackrabbit.extractor.jpg
>
> Indexing of binary content should be enhanced in order to allow either 
> configuration what fields are indexed or provide better support for custom 
> NodeIndexer implementations.
> The current design has a couple of flaws that should be addressed at the same 
> time:
> - Reader instances are requested from the text filters even though the reader 
> might never be used
> - only jcr:data properties of nt:resource nodes are fulltext indexed
> - It is up to the text filter implementation to decide the lucene field name 
> for the text representation, responsibility should be moved to the 
> NodeIndexer. A text filter should only provide a Reader instance.
> With those changes a custom NodeIndexer can then decide if a binary property 
> has one or more representations in the index.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (JCR-415) Enhance indexing of binary content

Reply via email to