[ 
https://issues.apache.org/jira/browse/LUCENE-5699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090443#comment-14090443
 ] 

Tommaso Teofili commented on LUCENE-5699:
-----------------------------------------

thanks Gergő, the patch looks much better.

bq. When I first tried to use the Lucene Classification, one of the bigger 
problem was, that the scores, whats come back means nothing. Basically the 
classifier returns the class, and a random number. If you have 2 text, and you 
push them in the classifier, the scores didn't help you to figure out what 
result is more trustworthy.

while the classification score doesn't of course return a random number, I 
agree the score should be normalized, between 0 and 1, the higher the better 
(basically this resumes in a probability measure).
Regarding the implementation I don't think the API needs to be touched for 
this, normalized scores should be always returned in _ClassificationResult_s by 
_Classifier#assignClass_ method implementations.

bq. If you can tell the user, how sure are you, it's not far that you want to 
tell them whats are the other options. What are the 3 more relevant or 5 more 
relevant class.

ok, the use case sounds reasonable, however my only concern (which extend to 
the normalization implementation as it's based on the generation of lists) 
relates to the fact that the current implementation may not scale well if you 
have huge number of classes.

Regarding API introduction I would be in favor in introducing something like 
_Classifier#getClasses(String text)_ returning a _List<ClassificationResult>_ 
for this use case, in alternative/addition _Classifier#getClasses(String text, 
int max)_ to filter the maximum number of classes to be returned (as the user 
is probably interested in the first N classes, rather than the whole list of 
classes). 


> Lucene classification score calculation normalize and return lists
> ------------------------------------------------------------------
>
>                 Key: LUCENE-5699
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5699
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: modules/classification
>            Reporter: Gergő Törcsvári
>            Assignee: Tommaso Teofili
>         Attachments: 06-06-5699.patch, 0730.patch, 0803-base.patch
>
>
> Now the classifiers can return only the "best matching" classes. If somebody 
> want it to use more complex tasks he need to modify these classes for get 
> second and third results too. If it is possible to return a list and it is 
> not a lot resource why we dont do that? (We iterate a list so also.)
> The Bayes classifier get too small return values, and there were a bug with 
> the zero floats. It was fixed with logarithmic. It would be nice to scale the 
> class scores sum vlue to one, and then we coud compare two documents return 
> score and relevance. (If we dont do this the wordcount in the test documents 
> affected the result score.)
> With bulletpoints:
> * In the Bayes classification normalized score values, and return with result 
> lists.
> * In the KNN classifier possibility to return a result list.
> * Make the ClassificationResult Comparable for list sorting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to