[ 
https://issues.apache.org/jira/browse/SOLR-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947289#comment-15947289
 ] 

Alessandro Benedetti commented on SOLR-10359:
---------------------------------------------

[~otis] :
The scope of this Jira Issue is to implement an easy out of the box approach to 
evaluate Solr.
It must be as simple as possible to configure, we want to hide the complexity 
to the users and offer an easy way to evaluate their search engine reducing the 
expertise required as much as possible.
Arguably even a Solr admin with minimal knowledge about evaluation metrics work 
should be able to use it and have an idea of the quality is achieving.
Given that, I absolutely agree that the strategy we use to store the data 
collected should be pluggable.
I would make an analogy with the way the Suggester Component manages lookup 
algorithms ( and data structures).
Solr should give the best option out of the box, but should be possible to 
configure it ( and potentially use third party systems).
I would say that this capability should not be in the first release of this 
functionality, but definitely later on :)

[~mnilsson]
1) the idea is actually not to use a separate, exposed collection, but to use 
an internal data structure ( potentially an auxiliary Lucene index ?) In my 
opinion it is vital to hide the complexity and give users an easy way to access 
it.
Internally i definitely agree we will re-use components and modules used in the 
stats and faceting areas.
I also agree that this is already achievable if we manually build the solr 
collection, model the data in a clever way and run specific stats/faceting 
queries.
Definitely a dcoumentation to do that would be useful.

2)I agree, the exposed update endpoint can be used for impressions as well ( in 
the draft data model, they will be relevancy_rating=0).
But I would leave the component to do it automatically available as well, for 
all the users that are happy to capture what Solr returns immediately, this 
could ease the client job and volume of data transferred.

3) Definitely a good idea

Related your last observation, I agree it is delicate. 
In a cluster scenario where the aggregator instances are separate from the 
shards, it would be possible to potentially add the User Interaction Logging 
components only in the aggregator request handlers.
In Solrcloud, what happens if we define 2 request handlers per collection ( one 
for aggregation with user interactions tracking and one not) and then in the 
aggregation request handler we use the qt.shards=localRequestHandler  ?
We will call the aggregation request handler as the SolrCloud entrypoint ( with 
tracking) and then internally it will aggregates from the local request 
handlers ( not tracked).
Just thinking loud so it may not work.

> User Interactions Logger Component
> ----------------------------------
>
>                 Key: SOLR-10359
>                 URL: https://issues.apache.org/jira/browse/SOLR-10359
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Alessandro Benedetti
>              Labels: CTR, evaluation
>
> *Introduction*
> Being able to evaluate the quality of your search engine is becoming more and 
> more important day by day.
> This issue is to put a milestone to integrate online evaluation metrics with 
> Solr.
> *Scope*
> Scope of this issue is to provide a set of components able to :
> 1) Collect Search Results impressions ( results shown per query)
> 2) Collect Users interactions ( user interactions on the search results per 
> query e.g. clicks, bookmarking,ect )
> 3) Calculate evaluation metrics on demand, such as Click Through Rate, DCG ...
> *Technical Design*
> A SearchComponent can be designed :
> *UsersEventsLoggerComponent*
> A property (such as storeDir) will define where the data collected will be 
> stored.
> Different data structures can be explored, to keep it simple, a first 
> implementation can be a Lucene Index.
> *Data Model*
> The user event can be modelled in the following way :
> <query> - the user query the event is related to
> <result_id> - the ID of the search result involved in the interaction
> <result_position> - the position in the ranking of the search result involved 
> in the interaction
> <timestamp> - time when the interaction happened
> <relevancy_rating> - 0 for impressions, a value between 1-5 to identify the 
> type of user event, the semantic will depend on the domain and use cases
> <test_group> - this can identify a variant, in A/B testing
> *Impressions Logging*
> When the SearchComponent  is assigned to a request handler, everytime it 
> processes a request and return to the user a result set for a query, the 
> component will collect the impressions ( results returned) and index them in 
> the auxiliary lucene index.
> This will happen in parallel as soon as you return the results to avoid 
> affecting the query time.
> Of course an impact on CPU load and memory is expected, will be interesting 
> to minimise it.
> * User Events Logging *
> An UpdateHandler will be exposed to accept POST requests and collect user 
> events.
> Everytime a request is sent, the user event will be indexed in the underline 
> auxiliary Lucene Index.
> * Stats Calculation *
> A RequestHandler will be exposed to be able to calculate stats and 
> aggregations for the metrics :
> /evaluation?metric=ctr&stats=query&compare=testA,testB
> This request could calculate the CTR for our testA and testB to compare.
> Showing stats in total and per query ( to highlight the queries with 
> lower/higher CTR).
> The calculations will happen separating the <test_group> for an easy 
> comparison.
> Will be important to keep it as simple as possible for a first version, to 
> then extend it as much as we like



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to