Here's a short summary of what I implemented in the end:

* I'm using an encoding scheme similar to the URL-encoding to support
special characters in field names. I didn't use directly the
URL-encoding because '+' (plus) and '%' (percent) have special meaning
in Solr query syntax. Also, I didn't want to encode Unicode letters.

E.g. "Somé Spâce.Bob's Claß" is encoded as "Somé$20Spâce.Bob$27s$20Claß"

* I wanted to be able to extract the class and property reference from
a field name in order to display the location where the search text
has been found. I couldn't use the default class / property reference
serialization syntax because '\' and '^' have special meaning in the
Solr query syntax. So I implemented a simple serialization syntax that
uses only '.' as entity separator and the dot is escaped by repeating
it.

E.g. "wiki:Some\.Space.My\.Class^color" is serialized as
"wiki.Some..Space.My..Class.color"

* I added the following fields to a document's index:

object : all types of objects found on the indexed document
object.Space.Class : collects values from all Space.Class properties
property.Space.Class.propName : indexes the values of
Space.Class^propName (multiple values if there are multiple objects of
type Space.Class)

* object.* and property.* are multilingual fields so they are indexed
in multiple languages. I added support for dynamic aliases (for
dynamic fields) so we can write

object:Blog.BlogPostClass AND property.Blog.BlogPostClass.title:text
AND object.XWiki.TagClass:news

and it will be expanded into

object:Blog.BlogPostClass AND
(property.Blog.BlogPostClass.title_en:text OR
property.Blog.BlogPostClass.title_fr:text OR ...) AND
(object.XWiki.TagClass_en:news OR object.XWiki.TagClass_fr:news OR
...)

NOTE: Solr doesn't support dynamic fields as default fields, i.e. as
fields that are matched when you search for free text (without
field:value in the query). This is not a problem for the search
results, as dynamic fields like object.* and property.* are copied and
aggregated in 'objcontent' which is a default field. The issue is that
we can't know what is exactly the XClass property that was matched, we
just know that the free search text was found inside an object.

WDYT? I can still make adjustments before 5.3 final if you think
something is wrong.

Thanks,
Marius

On Fri, Nov 15, 2013 at 9:01 AM, Paul Libbrecht <[email protected]> wrote:
> Hello Marius,
>
>>> I would suggest to generate the schema and config, reloading every time 
>>> there's a class change.
>>
>> That would mean re-indexing everything right? It would take to much time.
>
> No for most cases.
> A Lucene index is "just" a heap of "terms".
> If you change the schema in that you add a new field, the impact on the index 
> is zero.
> If you rename a field, you need to reindex.
> If you change the type of a field (or its analyzer) then you have to reindex.
> If you delete a field, you leave some dirt, you'd have to reindex if you 
> rewake this field name.
>
>>> I believe that the query-expansion step, from title:x to title-en:x 
>>> title-ft:x, etc… is best to be controlled early so that applications can 
>>> change that somehow. In curriki, this is done with a custom query-component 
>>> which uses the query-parser (with a default-field which does not exist) 
>>> then rewrites the query objects (which is a fairly easy game).
>>
>> That's actually what I'm currently investigating. I'll try to extend
>> the ExtendedDismaxQParserPlugin, let it do its query parsing and then
>> expand the query with more query objects when the "field" name matches
>> some pattern (e.g. property_*)
>
> I am not sure it's best practice, but as an application developer, I would 
> enjoy if this code was in a Groovy page.
>
> paul
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Reply via email to