[
https://issues.apache.org/jira/browse/SOLR-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443409#comment-13443409
]
Greg Bowyer commented on SOLR-3763:
-----------------------------------
I guess my next step is to get caching working, I am not sure quite how to take
baby steps with this beyond getting to feature parity.
> Make solr use lucene filters directly
> -------------------------------------
>
> Key: SOLR-3763
> URL: https://issues.apache.org/jira/browse/SOLR-3763
> Project: Solr
> Issue Type: Improvement
> Affects Versions: 4.0, 4.1, 5.0
> Reporter: Greg Bowyer
> Assignee: Greg Bowyer
> Attachments: SOLR-3763-Make-solr-use-lucene-filters-directly.patch
>
>
> Presently solr uses bitsets, queries and collectors to implement the concept
> of filters. This has proven to be very powerful, but does come at the cost of
> introducing a large body of code into solr making it harder to optimise and
> maintain.
> Another issue here is that filters currently cache sub-optimally given the
> changes in lucene towards atomic readers.
> Rather than patch these issues, this is an attempt to rework the filters in
> solr to leverage the Filter subsystem from lucene as much as possible.
> In good time the aim is to get this to do the following:
> ∘ Handle setting up filter implementations that are able to correctly cache
> with reference to the AtomicReader that they are caching for rather that for
> the entire index at large
> ∘ Get the post filters working, I am thinking that this can be done via
> lucenes chained filter, with the ‟expensive” filters being put towards the
> end of the chain - this has different semantics internally to the original
> implementation but IMHO should have the same result for end users
> ∘ Learn how to create filters that are potentially more efficient, at present
> solr basically runs a simple query that gathers a DocSet that relates to the
> documents that we want filtered; it would be interesting to make use of
> filter implementations that are in theory faster than query filters (for
> instance there are filters that are able to query the FieldCache)
> ∘ Learn how to decompose filters so that a complex filter query can be cached
> (potentially) as its constituent parts; for example the filter below
> currently needs love, care and feeding to ensure that the filter cache is not
> unduly stressed
> {code}
> 'category:(100) OR category:(200) OR category:(300)'
> {code}
> Really there is no reason not to express this in a cached form as
> {code}
> BooleanFilter(
> FilterClause(CachedFilter(TermFilter(Term("category", 100))), SHOULD),
> FilterClause(CachedFilter(TermFilter(Term("category", 200))), SHOULD),
> FilterClause(CachedFilter(TermFilter(Term("category", 300))), SHOULD)
> )
> {code}
> This would yeild better cache usage I think as we can resuse docsets across
> multiple queries as well as avoid issues when filters are presented in
> differing orders
> ∘ Instead of end users providing costing we might (and this is a big might
> FWIW), be able to create a sort of execution plan of filters, leveraging a
> combination of what the index is able to tell us as well as sampling and
> ‟educated guesswork”; in essence this is what some DBMS software, for example
> postgresql does - it has a genetic algo that attempts to solve the travelling
> salesman - to great effect
> ∘ I am sure I will probably come up with other ambitious ideas to plug in
> here ..... :S
> Patches obviously forthcoming but the bulk of the work can be followed here
> https://github.com/GregBowyer/lucene-solr/commits/solr-uses-lucene-filters
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]