[ 
https://issues.apache.org/jira/browse/JENA-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268016#comment-13268016
 ] 

laotao edited comment on JENA-242 at 5/4/12 2:09 AM:
-----------------------------------------------------

Raw Lucene scores (normalized or not) really don't reflect the absolute 
similarity between a query and the results. Maybe TF-IDF algorithm is not 
appropriate to calculate these similarities for RDF literals, because they are 
usually short, compared to the usual (web) documents. Have you considered other 
algorithms, e.g. minimal edit distance? 

Another clue to improve the search, I think, is to take the underlying ontology 
constructs into account. For example, when there is an exact basic pattern 
match that has owl:differentFrom relationship with a Lucene match, the 
similarity score of the latter should be cut significantly (even to zero, so 
that this Lucene match is abandoned). This is important because many resources 
which are owl:differentFrom from each other can be very similar, literally.
                
      was (Author: laotao):
    Raw Lucene scores (normalized or not) really don't reflect the absolute 
similarity between a query and the results. Maybe TF-IDF algorithm is not 
appropriate to calculate these similarities for RDF literals, because they are 
usually short, compared to the usual (web) documents. Have you considered other 
algorithms, e.g. minimal edit distance? 

Another clue to improve the search, I think, is to take the underling ontology 
constructs into account. For example, when there is an exact basic pattern 
match that has owl:differentFrom relationship with a Lucene match, the 
similarity score of the latter should be cut significantly (even to zero, so 
that this Lucene match is abandoned). This is important because many resources 
which are owl:differentFrom from each other can be very similar, literally.
                  
> LARQ scores not normalized
> --------------------------
>
>                 Key: JENA-242
>                 URL: https://issues.apache.org/jira/browse/JENA-242
>             Project: Apache Jena
>          Issue Type: Bug
>          Components: LARQ
>    Affects Versions: LARQ 1.0.0
>         Environment: Fuseki
>            Reporter: laotao
>
> In previous versions the LARQ score seemed to be normalized to range [0, 1]. 
> In LARQ 1.0.0 some scores can be higher than 1. 
> Normalized scores are needed to filter sparql results (so that only items 
> above certain quality is shown).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to