Re: [xwiki-devs] [Solr] What do we search for?

Marius Dumitru Florea Thu, 14 Nov 2013 21:49:29 -0800

On Fri, Nov 15, 2013 at 1:02 AM, Paul Libbrecht <[email protected]> wrote:
> Marius,
>


> I would suggest to generate the schema and config, reloading every time 
> there's a class change.

That would mean re-indexing everything right? It would take to much time.

> Alternatively, and that's how solr-drupal works, you would define the fields 
> by prefix but I am not sure the aliassing would work.
>

> I believe that the query-expansion step, from title:x to title-en:x 
> title-ft:x, etc… is best to be controlled early so that applications can 
> change that somehow. In curriki, this is done with a custom query-component 
> which uses the query-parser (with a default-field which does not exist) then 
> rewrites the query objects (which is a fairly easy game).

That's actually what I'm currently investigating. I'll try to extend
the ExtendedDismaxQParserPlugin, let it do its query parsing and then
expand the query with more query objects when the "field" name matches
some pattern (e.g. property_*)

Thanks,
Marius

>
> Hope it helps.
>
>
>
>
> Le 14 nov. 2013 à 17:28, Marius Dumitru Florea 
> <[email protected]> a écrit :
>
>> On Wed, Nov 13, 2013 at 8:08 PM, Ludovic Dubost <[email protected]> wrote:
>>> Hi Marius,
>>>
>>> I have a quick question when starting reading your proposal. I don't see
>>> anything about multi language indexing.
>>> I remember in the current SOLR implementation that there are multiple
>>> fields for each language. Would there be a fields for each language indexed
>>> for each property ?
>>
>> Yes. Right now I'm struggling to find a way to define an alias for a
>> group of dynamic fields. For document title we have this in
>> solrconfig.xml
>>
>> <str name="f.title.qf">title__ title_ar title_bg title_ca ...</str>
>>
>> which makes 'title' an alias for all its translations and allows us to
>> write title:text in the search query. I need to do the same, but
>> dynamically, for each object property:
>>
>> property_Blog.BlogPostClass_title =
>> property_Blog.BlogPostClass_title__,
>> property_Blog.BlogPostClass_title_en,
>> property_Blog.BlogPostClass_title_fr, ...
>>
>> I'll keep you posted.
>>
>> Thanks,
>> Marius
>>
>>>
>>> Ludovic
>>>
>>>
>>> 2013/10/14 Marius Dumitru Florea <[email protected]>
>>>
>>>> I started writing
>>>> http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema . I need help
>>>> with two things:
>>>>
>>>> * test cases
>>>> http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema#HTestCases
>>>> * if time permits, review the proposal, especially
>>>> http://dev.xwiki.org/xwiki/bin/view/Design/SolrSchema#HAMixedApproach
>>>> .
>>>>
>>>> Thanks,
>>>> Marius
>>>>
>>>>
>>>> On Fri, Oct 11, 2013 at 12:55 PM, Marius Dumitru Florea
>>>> <[email protected]> wrote:
>>>>> Hi devs,
>>>>>
>>>>> This is a very important question so think carefully. Let me explain:
>>>>>
>>>>> In XWiki (model) we have a few entity types. There are *wikis* which
>>>>> have *spaces* which have *documents*. A document can have *objects*
>>>>> and *attachments*. A document can also define a *class*.
>>>>>
>>>>> At the same time we like to say that in XWiki "everything is a
>>>>> document" because everything revolves around documents. The document
>>>>> is the central notion.
>>>>>
>>>>> We can query the database (using HQL or XWQL) for any of the
>>>>> previously mentioned entities but what should a Solr query return
>>>>> (semantically)? In other words:
>>>>>
>>>>> * are you searching for an object without caring about the document
>>>>> that holds the object? Same for an object property.
>>>>> * how often are you searching for an attachment without caring about
>>>>> the document that holds the attachment?
>>>>> * are you searching for a class or for the document that defines that
>>>> class?
>>>>> * are you searching for a wiki without caring about the documents it
>>>>> contains? Same for a space.
>>>>>
>>>>> IMO the result of a Solr query should be, semantically, a list of
>>>>> documents. But maybe I'm wrong.
>>>>>
>>>>> -----------------------
>>>>> Technical Details
>>>>> -----------------------
>>>>>
>>>>> Unlike a relational database, Solr/Lucene index has a single 'table'.
>>>>> So normally you index a single entity type. Each row in the index
>>>>> represents an entity of that type. As a consequence the result of a
>>>>> Solr query is semantically a list of entities of that type. In our
>>>>> case the entity type is (naturally) *document*.
>>>>>
>>>>> If you want to index more entity types (e.g. index attachments and
>>>>> objects _separately_, not as part of a document) then, since there is
>>>>> only one 'table' in the index, you need to add a 'type' column that
>>>>> specifies the type of entity you have on each row (e.g. type=document,
>>>>> type=attachment, type=object etc.). The result of a Solr query is now,
>>>>> semantically, a list of different entity types, unless you filter by a
>>>>> specific type. It smells like a hack to me.
>>>>>
>>>>> Let's imagine what happens if we want to search for blog posts that
>>>>> has a specific tag. With the first approach this is easy because all
>>>>> the (indexed) information is on a single row. With the second approach
>>>>> this is considerably more complex because the information is spread on
>>>>> multiple rows:
>>>>>
>>>>> * one row with type=document for the blog post document
>>>>> * one row with type=object for the blog post object
>>>>> * one row with type=object for the tab object
>>>>>
>>>>> In a relational database when you have the information spread in
>>>>> multiple places (tables) you do joins. Fortunately (you would says)
>>>>> Solr supports joins. In this particular case we would have to perform
>>>>> 2 joins which means:
>>>>>
>>>>> index X index X index
>>>>>
>>>>> where X represents the cartesian product. The document name would be
>>>>> the join key. Pretty complex even before trying to write this in Solr
>>>>> query syntax..
>>>>>
>>>>> So basically the question becomes: is it worth indexing more entities
>>>>> _separately_ instead of indexing just documents (with info about their
>>>>> objects and attachments) considering the complexity that it brings in
>>>>> writing Solr queries? Do we search for objects and attachments alone
>>>>> as separate entities often enough to justify this complexity? My
>>>>> answer is no.
>>>>>
>>>>> Thanks,
>>>>> Marius
>>>> _______________________________________________
>>>> devs mailing list
>>>> [email protected]
>>>> http://lists.xwiki.org/mailman/listinfo/devs
>>>>
>>>
>>>
>>>
>>> --
>>> Ludovic Dubost
>>> Founder and CEO
>>> Blog: http://blog.ludovic.org/
>>> XWiki: http://www.xwiki.com
>>> Skype: ldubost GTalk: ldubost
>>> _______________________________________________
>>> devs mailing list
>>> [email protected]
>>> http://lists.xwiki.org/mailman/listinfo/devs
>> _______________________________________________
>> devs mailing list
>> [email protected]
>> http://lists.xwiki.org/mailman/listinfo/devs
>
> _______________________________________________
> devs mailing list
> [email protected]
> http://lists.xwiki.org/mailman/listinfo/devs
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] [Solr] What do we search for?

Reply via email to