Hi Grant,
Thanks for the reply. I saw that I had a problem in the code that prints
these (very stupid mistake)
int docId = hits[j].doc;
Document curDoc = searcher.doc(docId);
and then to the explain method, I gave j instead of docId.
But I have a questino regarding the fieldNorm -
When I have 60000 documents like these in the index, I run this query, and
the results is that the doc that I look for is in the third place and not in
the first place.
So, I looked for a word TTD that for it I create a boolean query:
For finlin, 6621468 * 6, 5265266 * 12
String word = "TTD";
String worlds = "6621468";
DoubleMap wordsWorldsFreqMap = new DoubleMap();
wordsWorldsFreqMap.insert("TTD", 6621468, 3.0);
BooleanQuery bq = new BooleanQuery();
String[] splitWorlds = worlds.split(" ");
// loop on the worlds and for every world, take the boost from the map
for(int i = 0; i < splitWorlds.length; i++)
{
double freq =
wordsWorldsFreqMap.getMap().get(word).get(Long.parseLong(splitWorlds[i]));
BoostingTermQuery tq = new BoostingTermQuery(new Term(fieldName,
splitWorlds[i]));
tq.setBoost((float) freq);
bq.add(tq, BooleanClause.Occur.SHOULD);
}
IndexSearcher searcher = new IndexSearcher("WordIndex");
//wordIndexTry");
searcher.setSimilarity(new WordsSimilarity());
TopDocCollector collector = new TopDocCollector(30);
(The index created using the code I sent in previous emails. It gave me the
desired results when I put a small amount of document (words) in the index)
The output is the following:
*finlin, score: 19.366615*
19.366615 = (MATCH) fieldWeight(worlds:6621468^3.0 in 35433), product of:
4.2426405 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
6.0 = scorePayload(...)
7.3036084 = idf(worlds: 6621468=110)
0.625 = fieldNorm(field=worlds, doc=35433)
*TTD, score: 15.493294*
15.493293 = (MATCH) fieldWeight(worlds:6621468^3.0 in 20), product of:
2.1213202 = (MATCH) btq, product of:
0.70710677 = tf(phraseFreq=0.5)
3.0 = scorePayload(...)
7.3036084 = idf(worlds: 6621468=110)
1.0 = fieldNorm(field=worlds, doc=20)
Can anyone explain me the highlighted parts of the score?
I read all the explanations in the api and read a lot of threads about the
scoring, but didn't really understand these factors.
Why in finlin, we have the doc 35433 and in TTD, its 20?
I didn't set any fieldNorm or DocumentNorm.
Thanks a lot,
Liat
2009/5/7 Grant Ingersoll <[email protected]>
> Hi Liat,
>
> Can you post the code you are using to generate the info below?
>
> -Grant
>
>
> On May 3, 2009, at 11:43 PM, liat oren wrote:
>
> I looked into the output again, and saw that the explain method, explains a
>> different result then the document i thought it did.
>>
>> Within the loop of the results, I replaced
>> int docId = hits[j].doc;
>> Document curDoc = searcher.doc(docId);
>> with
>> Document curDoc = searcher.doc(j);
>> So I got the right explain to the documnet.
>>
>> The strange things I got are:
>> 1. The explain is much shorter as you can see below
>> 2. the score of finlin (1.6479614) is differnt than the one in the explain
>> (0.34333253)
>> 3. I think it is because of the fieldNorm. why is it differnt than the one
>> of TTD?
>>
>> finlin, score: 1.6479614
>> 0.3433253 = (MATCH) fieldWeight(worlds:666666 in 0), product of:
>> 0.70710677 = (MATCH) btq, product of:
>> 0.70710677 = tf(phraseFreq=0.5)
>> 1.0 = scorePayload(...)
>> 0.7768564 = idf(worlds: 666666=4)
>> 0.625 = fieldNorm(field=worlds, doc=0)
>>
>> TTD, score: 1.6479614
>> 1.6479613 = (MATCH) fieldWeight(worlds:666666 in 1), product of:
>> 2.1213202 = (MATCH) btq, product of:
>> 0.70710677 = tf(phraseFreq=0.5)
>> 3.0 = scorePayload(...)
>> 0.7768564 = idf(worlds: 666666=4)
>> 1.0 = fieldNorm(field=worlds, doc=1)
>>
>> Thanks again,
>> Liat
>>
>>
>>
>> 2009/5/3 liat oren <[email protected]>
>>
>> Hi,
>>>
>>> I try to debug boosting query.
>>> Is there a way to see the term boost in the documents? I see them in
>>> spans
>>> in BoostingTermQuery, yet, from there I can't see which document I am in.
>>> If I want to copy some of the document in an index that saves the
>>> boosting
>>> - how can it be done?
>>>
>>> The problem I am facing is that I get unexpected results - If for word
>>> "a",
>>> I have the worlds "1111" (boosting 3) and "2222" and for word "b" I have
>>> the
>>> world "1111". When I try to search for "1111" (boosting 5), word "a" gets
>>> better results.
>>>
>>> When I debugged it, I saw that the boosting is always three, but since in
>>> the index I have a lot of documents, I tried to do the same on a smaller
>>> index.
>>>
>>> I put only two words as you can see in the code below (I put all the
>>> methods and classes needed to run this code).
>>>
>>> The problem I saw here is the scorePayload in the Explain method - it
>>> took
>>> a differnt value from the one I indexed.
>>> You can see below the output - for TTD - 1.0 = scorePayload(...)
>>> and for finlin 3.0 = scorePayload(...)
>>> while the boosting I used was the opposite - for TTD, I used 3 and for
>>> finlin, I used 1
>>>
>>> The scorePayload should be the factor I put when I indexed, right?
>>>
>>> Thanks a lot,
>>> Liat
>>>
>>> TTD, score: 1.2611988
>>>
>>> 0.26274973 = (MATCH) weight(worlds:666666 in 0), product of:
>>> 0.99999994 = queryWeight(worlds:666666), product of:
>>> 0.5945349 = idf(worlds: 666666=2)
>>> 1.681987 = queryNorm
>>> 0.26274976 = (MATCH) fieldWeight(worlds:666666 in 0), product of:
>>> 0.70710677 = (MATCH) btq, product of:
>>> 0.70710677 = tf(phraseFreq=0.5)
>>> 1.0 = scorePayload(...)
>>> 0.5945349 = idf(worlds: 666666=2)
>>> 0.625 = fieldNorm(field=worlds, doc=0)
>>> ********************************************************
>>> finlin, score: 0.26274976
>>>
>>> 1.2611988 = (MATCH) weight(worlds:666666 in 1), product of:
>>> 0.99999994 = queryWeight(worlds:666666), product of:
>>> 0.5945349 = idf(worlds: 666666=2)
>>> 1.681987 = queryNorm
>>> 1.2611989 = (MATCH) fieldWeight(worlds:666666 in 1), product of:
>>> 2.1213202 = (MATCH) btq, product of:
>>> 0.70710677 = tf(phraseFreq=0.5)
>>> 3.0 = scorePayload(...)
>>> 0.5945349 = idf(worlds: 666666=2)
>>> 1.0 = fieldNorm(field=worlds, doc=1)
>>>
>>> *The code*
>>> **
>>> public class Test
>>> {
>>> public Test()
>>> {
>>> }
>>> public static void main(String[] args) throws IOException, Exception
>>> {
>>> Test st = new Test();
>>> st.index(); //
>>> st.testRealIndex();
>>> }
>>> public void index() throws IOException
>>> {
>>> DoubleMap wordMap = new DoubleMap();
>>> wordMap.insert("TTD", 666666, 3);
>>> wordMap.insert("finlin", 666666, 1);
>>> wordMap.insert("finlin", 222222, 2);
>>> index(wordMap, "wordIndexTry", "", "0");
>>> }
>>> public synchronized void index(DoubleMap doubleMap, String dirPath,
>>> String
>>> originalPath, String includeFreq) throws IOException
>>> {
>>> File f = new File(dirPath);
>>> IndexWriter writer = null;
>>> PayloadAnalyzer panalyzer = new PayloadAnalyzer();
>>> if(f.exists())
>>> {
>>> writer = new IndexWriter(dirPath, panalyzer, false);
>>> }
>>> else
>>> {
>>> writer = new IndexWriter(dirPath, panalyzer, true);
>>> }
>>> Iterator it = doubleMap.getMap().entrySet().iterator();
>>> int count = 0;
>>> int size = doubleMap.getMap().size();
>>> while(it.hasNext())
>>> {
>>> count++;
>>> Map.Entry entry = (Map.Entry) it.next();
>>> String word = entry.getKey().toString();
>>> Word w = new Word();
>>> w.word = word;
>>> Date date = new Date();
>>> System.out.println(date.toString() + " : Updateing word " + word + " ( "
>>> + count + " out of " + size + ") " + " FROM " + originalPath);
>>> Map<Long, Double> innerMap = (Map<Long, Double>) entry.getValue();
>>> Map<String, Integer> scoresMap = processMap(writer, panalyzer, innerMap,
>>> entry, w, dirPath, includeFreq);
>>> index(writer, panalyzer, innerMap, scoresMap, w, dirPath, includeFreq);
>>> }
>>> System.out.println("Optimizing " + dirPath + " ...");
>>> writer.optimize();
>>> writer.close();
>>> }
>>> public synchronized Map<String, Integer> processMap(IndexWriter writer,
>>> PayloadAnalyzer panalyzer, Map<Long, Double> innerMap, Map.Entry entry,
>>> Word
>>> w, String dirPath, String includeFreq) throws IOException
>>> {
>>> Map<String, Integer> scoresMap = new HashMap<String, Integer>();
>>> Iterator worldsIter = innerMap.entrySet().iterator();
>>> String worlds = "";
>>> synchronized(worldsIter)
>>> {
>>> while(worldsIter.hasNext())
>>> {
>>> Map.Entry worldsEntry = (Map.Entry) worldsIter.next();
>>> String world = worldsEntry.getKey().toString();
>>> int freq = (int) Double.parseDouble(worldsEntry.getValue().toString());
>>> scoresMap.put(world, freq);
>>> worlds += world + " ";
>>> FileUtil.writeToFile("Output\\WordWorldsFreq.txt", w.word +
>>> Constants.TAB_SEP + world + Constants.TAB_SEP + freq);
>>> }
>>> }
>>> panalyzer.setMapScores(scoresMap);
>>> //MapUtil.copyStringIntMap(scoresMap));
>>> return scoresMap;
>>> }
>>> public synchronized void index(IndexWriter writer, PayloadAnalyzer
>>> panalyzer, Map<Long, Double> innerMap, Map<String, Integer> scoresMap,
>>> Word
>>> w, String dirPath, String includeFreq) throws IOException
>>> {
>>> System.out.println("indexing");
>>> w.worldsMap = innerMap;
>>> WordIndex wi = new WordIndex(w);
>>> wi.createDocument(includeFreq);
>>> writer.addDocument(wi.getDocument());
>>> }
>>> public void testRealIndex() throws IOException
>>> {
>>> String word = "TTD";
>>> String worlds = "666666";
>>> DoubleMap wordsWorldsFreqMap = new DoubleMap();
>>> wordsWorldsFreqMap.insert("TTD", 666666, 1.0);
>>> BoostingBooleanQueryParser bbqp = new BoostingBooleanQueryParser();
>>> BooleanQuery bq = bbqp.parse(word, worlds, wordsWorldsFreqMap,
>>> "worlds");
>>> IndexSearcher searcher = new IndexSearcher("wordIndexTry");
>>> //D:\\PaiDatabase\\Indexes\\WordIndex");
>>> searcher.setSimilarity(new WordsSimilarity());
>>> TopDocCollector collector = new TopDocCollector(30);
>>> searcher.search(bq, collector);
>>> ScoreDoc[] hits = collector.topDocs().scoreDocs;
>>> for(int j = 0; j < Math.min(hits.length, 10); j++)
>>> {
>>> int docId = hits[j].doc;
>>> Document curDoc = searcher.doc(docId);
>>> System.out.println(curDoc.getField("word").stringValue() + ", score: " +
>>> hits[j].score);
>>> Explanation explanation = searcher.explain(bq, j);
>>> System.out.println(explanation.toString());
>>> String sym = curDoc.getField("word").stringValue();
>>> }
>>> }
>>> public abstract class Index
>>> {
>>> protected Document doc = new Document();
>>> public Index()
>>> {
>>> }
>>> public Document getDocument()
>>> {
>>> return doc;
>>> }
>>> public void setDocument(Document d)
>>> {
>>> this.doc = d;
>>> }
>>> }
>>> public class WordIndex extends Index
>>> {
>>> protected Word w;
>>> public String FIELD_WORD = "word";
>>> public String FIELD_WORLDS = "worlds";
>>> public WordIndex(Word w)
>>> {
>>> this.w = w;
>>> }
>>> public void createDocument(String includeFreq) throws
>>> java.io.FileNotFoundException
>>> {
>>> // make a new, empty document
>>> doc = new Document();
>>> doc.add(new Field(FIELD_WORD, w.word, Field.Store.YES,
>>> Field.Index.NOT_ANALYZED));
>>> doc.add(new Field(FIELD_WORLDS,
>>> String.valueOf(w.getWorldIds(includeFreq)), Field.Store.YES,
>>> Field.Index.ANALYZED, Field.TermVector.YES));
>>> }
>>> public Document getDoc(String word, String indexPath) throws IOException
>>> {
>>> IndexSearcher mapSearcher = new IndexSearcher(indexPath);
>>> TermQuery mapQuery = new TermQuery(new Term(FIELD_WORD, word));
>>> Hits mapHits = mapSearcher.search(mapQuery);
>>> if(mapHits.length() != 0)
>>> {
>>> Document doc = mapHits.doc(0);
>>> return doc;
>>> }
>>> return null;
>>> }
>>> }
>>> public class Word
>>> {
>>> public String word;
>>> public Map<Long, Double> worldsMap = new HashMap<Long, Double>();
>>> public Word()
>>> {
>>> }
>>> public String getWorldIds(String includeFreq)
>>> {
>>> String worlds = "";
>>> Iterator iter = worldsMap.entrySet().iterator();
>>> while(iter.hasNext())
>>> {
>>> Map.Entry entry = (Map.Entry) iter.next();
>>> if(includeFreq.equals("1"))
>>> {
>>> int freq = (int) Double.parseDouble(entry.getValue().toString());
>>> for(int i = 0; i < freq; i++)
>>> {
>>> worlds += entry.getKey().toString() + " ";
>>> }
>>> }
>>> else
>>> {
>>> worlds += entry.getKey().toString() + " ";
>>> }
>>> }
>>> return worlds;
>>> }
>>> }
>>> public class DoubleMap
>>> {
>>> private Map<String, Map<Long, Double>> map;
>>> public Map<String, String> worldsListMap = new HashMap<String,
>>> String>();
>>> public List<String> entriesList = new ArrayList<String>();
>>> public DoubleMap()
>>> {
>>> map = new HashMap<String, Map<Long, Double>>();
>>> }
>>> public void insert(String word, long worldId, double beta)
>>> {
>>> if(map.get(word) != null)
>>> {
>>> Map<Long, Double> innerMap = map.get(word);
>>> if(innerMap.get(worldId) != null)
>>> {
>>> return;
>>> }
>>> innerMap.put(worldId, beta);
>>> map.put(word, innerMap);
>>> }
>>> else
>>> {
>>> Map<Long, Double> innerMap = new HashMap<Long, Double>();
>>> innerMap.put(worldId, beta);
>>> map.put(word, innerMap);
>>> }
>>> }
>>> public void insert(String word, long worldId, double beta, int size)
>>> {
>>> if(map.get(word) != null)
>>> {
>>> Map<Long, Double> innerMap = map.get(word);
>>> if(innerMap.get(worldId) != null)
>>> {
>>> return;
>>> }
>>> if(innerMap.size() == size)
>>> {
>>> Iterator iter = innerMap.entrySet().iterator();
>>> int count = 0;
>>> while(iter.hasNext())
>>> {
>>> Map.Entry entry = (Map.Entry) iter.next();
>>> count++;
>>> }
>>> System.out.println(count);
>>> long minWorldId = getMinItem(innerMap);
>>> innerMap.remove(minWorldId);
>>> }
>>> innerMap.put(worldId, beta);
>>> map.put(word, innerMap);
>>> }
>>> else
>>> {
>>> Map<Long, Double> innerMap = new HashMap<Long, Double>();
>>> innerMap.put(worldId, beta);
>>> map.put(word, innerMap);
>>> }
>>> }
>>> private long getMinItem(Map<Long, Double> innerMap)
>>> {
>>> Iterator it = innerMap.entrySet().iterator();
>>> long worldId = -1;
>>> while(it.hasNext())
>>> {
>>> Map.Entry entry = (Map.Entry) it.next();
>>> worldId = Long.parseLong(entry.getKey().toString());
>>> }
>>> return worldId;
>>> }
>>> public Map<String, Map<Long, Double>> getMap()
>>> {
>>> return map;
>>> }
>>> }
>>> public class BoostingBooleanQueryParser
>>> {
>>> public BoostingBooleanQueryParser()
>>> {
>>> }
>>> public BooleanQuery parse(String word, String worlds, DoubleMap
>>> wordsWorldsFreqMap, String fieldName) throws IOException
>>> {
>>> BooleanQuery bq = new BooleanQuery();
>>> String[] splitWorlds = worlds.split(" ");
>>> for(int i = 0; i < splitWorlds.length; i++)
>>> {
>>> double freq =
>>>
>>> wordsWorldsFreqMap.getMap().get(word).get(Long.parseLong(splitWorlds[i]));
>>> BoostingTermQuery tq = new BoostingTermQuery(new Term(fieldName,
>>> splitWorlds[i]));
>>> tq.setBoost((float) freq);
>>> bq.add(tq, BooleanClause.Occur.SHOULD);
>>> }
>>> return bq;
>>> }
>>> }
>>> public class PayloadAnalyzer extends Analyzer
>>> {
>>> private PayloadTokenStream payToken = null;
>>> private int score;
>>> private Map<String, Integer> scoresMap = new HashMap<String, Integer>();
>>> public synchronized void setScore(int s)
>>> {
>>> score = s;
>>> }
>>> public synchronized void setMapScores(Map<String, Integer> scoresMap)
>>> {
>>> this.scoresMap = scoresMap;
>>> }
>>> public final TokenStream tokenStream(String field, Reader reader)
>>> {
>>> payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader));
>>> //new LowerCaseTokenizer(reader));
>>> payToken.setScore(score);
>>> payToken.setMapScores(scoresMap);
>>> return payToken;
>>> }
>>> }
>>> public class PayloadTokenStream extends TokenStream
>>> {
>>> private Tokenizer tok = null;
>>> private int score;
>>> private Map<String, Integer> scoresMap = new HashMap<String, Integer>();
>>> public PayloadTokenStream(Tokenizer tokenizer)
>>> {
>>> tok = tokenizer;
>>> }
>>> public void setScore(int s)
>>> {
>>> score = s;
>>> }
>>> public synchronized void setMapScores(Map<String, Integer> scoresMap)
>>> {
>>> this.scoresMap = scoresMap;
>>> }
>>> public Token next(Token t) throws IOException
>>> {
>>> t = tok.next(t);
>>> if(t != null)
>>> {
>>> //t.setTermBuffer("can change");
>>> //Do something with the data
>>> byte[] bytes = ("score:" + score).getBytes();
>>> // t.setPayload(new Payload(bytes));
>>> String word = String.copyValueOf(t.termBuffer(), 0, t.termLength());
>>> if(!word.equals("") && word != null)
>>> {
>>> int score = scoresMap.get(word);
>>> if(score > 127)
>>> {
>>> score = 127;
>>> }
>>> byte payLoad = Byte.parseByte(String.valueOf(score));
>>> t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));
>>> }
>>> }
>>> return t;
>>> }
>>> public void reset(Reader input) throws IOException
>>> {
>>> tok.reset(input);
>>> }
>>> public void close() throws IOException
>>> {
>>> tok.close();
>>> }
>>> }
>>> }
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>