Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

andy Wed, 15 Jan 2014 00:48:38 -0800

Hi guys,

As the topic,it seems that the length of filed does not affect the doc score
accurately for chinese analyzer in my source code


index source code

 private static Directory DIRECTORY;


    @BeforeClass
    public static void before() throws IOException {
          DIRECTORY = new RAMDirectory();
          Analyzer chineseanalyzer = new
SmartChineseAnalyzer(Version.LUCENE_40);
          IndexWriterConfig indexWriterConfig = new
IndexWriterConfig(Version.LUCENE_40,chineseanalyzer);
          FieldType nameType = new FieldType();
          nameType.setIndexed(true);
          nameType.setStored(true);
          nameType.setOmitNorms(false);
          try {
              IndexWriter indexWriter = new IndexWriter(DIRECTORY,
indexWriterConfig);

              List<String> nameList = new ArrayList<String>();
             
nameList.add("咨询公司");nameList.add("飞鹰咨询管理咨询公司");nameList.add("北京中标咨询公司");nameList.add("重庆咨询公司");nameList.add("商务咨询服务公司");nameList.add("法律咨询公司");
              for (int i = 0; i < nameList.size(); i++) {
                  Document document = new Document();
                  document.add(new Field("name", nameList.get(i),
nameType));
                  document.add(new
Field("id",String.valueOf(i+1),nameType));
                  indexWriter.addDocument(document);
            }
              indexWriter.commit();
          } catch (IOException e) {
              // TODO Auto-generated catch block
              e.printStackTrace();
          }
    }

search snippet:
 @Test
    public void testChinese() throws IOException, ParseException {
        String keyword = "咨询公司";
        System.out.println("Searching for:" + keyword);
        System.out.println();
        IndexReader indexReader = DirectoryReader.open(DIRECTORY);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        Query query = null;
        query = new QueryParser(Version.LUCENE_40,"name",new
SmartChineseAnalyzer(Version.LUCENE_40)).parse(keyword);
        TopDocs topDocs = indexSearcher.search(query,15);
        System.out.println("Search Result:");
        if (null !=topDocs && 0 < topDocs.totalHits) {
            for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
                System.out.println("doc id:" +
indexSearcher.doc(scoreDoc.doc).get("id"));
                String name = indexSearcher.doc(scoreDoc.doc).get("name");
                System.out.println("content of Field:" + name);
                dumpCNTokens(name);
                System.out.println("score:" + scoreDoc.score);
               
System.out.println("-------------------------------------------");
            }
        } else {
            System.out.println("no results");
        }

    }


And search result as follows:
Searching for:咨询公司

Search Result:
doc id:1
content of Field:咨询公司
Terms:咨询        公司      
score:0.74763227
-------------------------------------------
doc id:2
content of Field:飞鹰咨询管理咨询公司
Terms:飞鹰        咨询      管理      咨询      公司      
score:0.6317303
-------------------------------------------
doc id:3
content of Field:北京中标咨询公司
Terms:北京        中标      咨询      公司      
score:0.5981058
-------------------------------------------
doc id:4
content of Field:重庆咨询公司
Terms:重庆        咨询      公司      
score:0.5981058
-------------------------------------------
doc id:5
content of Field:商务咨询服务公司
Terms:商务        咨询      服务      公司      
score:0.5981058
-------------------------------------------
doc id:6
content of Field:法律咨询公司
Terms:法律        咨询      公司      
score:0.5981058
-------------------------------------------

docs:3,4,5,6 have the same score, but I think the doc 4 and doc 6 should
have a higner score than the doc 3,5, becase the doc 4 and doc 6 have three
terms ,doc 3,5 have four terms. 
Am I right? who can give me a explanation? And how to get the expected
result?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Length-of-the-filed-does-not-affect-the-doc-score-accurately-for-chinese-analyzer-SmartChineseAnalyz-tp4111390.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Length of the filed does not affect the doc score accurately for chinese analyzer(SmartChineseAnalyzer)

Reply via email to