[ https://issues.apache.org/jira/browse/JENA-242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268016#comment-13268016 ]
laotao edited comment on JENA-242 at 5/4/12 2:09 AM: ----------------------------------------------------- Raw Lucene scores (normalized or not) really don't reflect the absolute similarity between a query and the results. Maybe TF-IDF algorithm is not appropriate to calculate these similarities for RDF literals, because they are usually short, compared to the usual (web) documents. Have you considered other algorithms, e.g. minimal edit distance? Another clue to improve the search, I think, is to take the underlying ontology constructs into account. For example, when there is an exact basic pattern match that has owl:differentFrom relationship with a Lucene match, the similarity score of the latter should be cut significantly (even to zero, so that this Lucene match is abandoned). This is important because many resources which are owl:differentFrom from each other can be very similar, literally. was (Author: laotao): Raw Lucene scores (normalized or not) really don't reflect the absolute similarity between a query and the results. Maybe TF-IDF algorithm is not appropriate to calculate these similarities for RDF literals, because they are usually short, compared to the usual (web) documents. Have you considered other algorithms, e.g. minimal edit distance? Another clue to improve the search, I think, is to take the underling ontology constructs into account. For example, when there is an exact basic pattern match that has owl:differentFrom relationship with a Lucene match, the similarity score of the latter should be cut significantly (even to zero, so that this Lucene match is abandoned). This is important because many resources which are owl:differentFrom from each other can be very similar, literally. > LARQ scores not normalized > -------------------------- > > Key: JENA-242 > URL: https://issues.apache.org/jira/browse/JENA-242 > Project: Apache Jena > Issue Type: Bug > Components: LARQ > Affects Versions: LARQ 1.0.0 > Environment: Fuseki > Reporter: laotao > > In previous versions the LARQ score seemed to be normalized to range [0, 1]. > In LARQ 1.0.0 some scores can be higher than 1. > Normalized scores are needed to filter sparql results (so that only items > above certain quality is shown). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira