Yes, this is more or less what I had in mind. However, for this approach
one requires some *prior knowledge* of the vocabulary of the document (or
the collection) to produce that score before even it gets analyzed, isn't
it? And this is the paradox that I have been thinking. If you have that
knowledge, that's fine. In addition, for applications that only require a
small term window to generate a score (such as term in context score) this
can be implemented very easy.
It is possible to inject the document dependent boost/score generation
*logic* (an interface would do) to the Tokenizer/TokenStream. However, I
am afraid this may have an indexing time penalty. If your window size is
the document itself, you will be doing the same job twice (calculating the
num of times a term occurs in doc X, index time weights etc.). IndexWriter
already does these somewhere down deep.
Simply put, I want to add some scores to documents/terms, but I can't
generate that score before I observe the document/terms. If I do that I
would replicate some of the work that is being already done by
IndexWriter.
If I remember it correctly, there is also some intention to add document
payloads functionality. I have the same concerns on this. So I think we
need a clear view on the topic. Where is the payload work moving? How we
can generate a score without duplicating some of the work that IndexWriter
is doing? I guess Michael Busch is working on document payloads for
release 3.0. I would appreciate if someone can enlighten us on how that
would work and can be utilised, in particularly during the analysis phase?
Cheers,
Murat
> Thanks, Murat.
> It was very useful - I also tried to override IndexWriter and
> DocumentsWriter instead, but it didn't work well. DocumentsWriter can't
> be
> overriden.
>
> So, I didn't find a better way to make the changes.
>
> My needs are having for every term in different documents different
> values.
> So, like you set the boost at the document level, I would like to set the
> boost for different terms within differnt documents.
>
> For that matter, I made some changes in the code you sent - (I coloured
> the
> changes in red):
>
> Below you can find an example for the use of it
>
> **********
> private class PayloadAnalyzer extends Analyzer
> {
> private PayloadTokenStream payToken = null;
> private int score;
> *private Map<String, Integer> scoresMap = new HashMap<String,
> Integer>();*
> public synchronized void setScore(int s)
> {
> score = s;
> }
> * public synchronized void setMapScores(Map<String, Integer> scoresMap)
> {
> this.scoresMap = scoresMap;
> }*
> public final TokenStream tokenStream(String field, Reader reader)
> {
> payToken = new PayloadTokenStream(new WhitespaceTokenizer(reader));
> //new
> LowerCaseTokenizer(reader));
> payToken.setScore(score);
> payToken.setMapScores(scoresMap);
> return payToken;
> }
> }
> private class PayloadTokenStream extends TokenStream
> {
> private Tokenizer tok = null;
> private int score;
> *private Map<String, Integer> scoresMap = new HashMap<String,
> Integer>();*
> public PayloadTokenStream(Tokenizer tokenizer)
> {
> tok = tokenizer;
> }
> public void setScore(int s)
> {
> score = s;
> }
> * public synchronized void setMapScores(Map<String, Integer> scoresMap)
> {
> this.scoresMap = scoresMap;
> }*
> public Token next(Token t) throws IOException
> {
> t = tok.next(t);
> if(t != null)
> {
> //t.setTermBuffer("can change");
> //Do something with the data
> byte[] bytes = ("score:" + score).getBytes();
> // t.setPayload(new Payload(bytes));
> * String word = String.copyValueOf(t.termBuffer(), 0, t.termLength());
> int score = scoresMap.get(word);
> byte payLoad = Byte.parseByte(String.valueOf(score));
> t.setPayload(new Payload(new byte[] { Byte.valueOf(payLoad) }));*
> }
> return t;
> }
> public void reset(Reader input) throws IOException
> {
> tok.reset(input);
> }
> public void close() throws IOException
> {
> tok.close();
> }
> }
> **********************************
> *Example for the use of payloads:*
>
> PayloadAnalyzer panalyzer = new PayloadAnalyzer();
> File index = new File("" + "TestSearchIndex");
> IndexWriter iwriter = new IndexWriter(index, panalyzer);
> Document d = new Document();
> d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
> Field.Index.TOKENIZED, Field.TermVector.YES));
> d.add(new Field("id", "1^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
> Field.TermVector.NO));
> Map<String, Integer> mapScores = new HashMap<String, Integer>();
> mapScores.put("word1", 3);
> mapScores.put("word2", 1);
> mapScores.put("word3", 1);
> panalyzer.setMapScores(mapScores);
> iwriter.addDocument(d, panalyzer);
> d = new Document();
> d.add(new Field("text", "word1 word2 word3", Field.Store.YES,
> Field.Index.TOKENIZED, Field.TermVector.YES));
> d.add(new Field("id", "2^3", Field.Store.YES, Field.Index.UN_TOKENIZED,
> Field.TermVector.NO));
> //We set the score for the term of the document that will be
> analyzed.
> /*I was worried about this part - document dependent score
> which may be utilized*/
> mapScores = new HashMap<String, Integer>();
> mapScores.put("word1", 1);
> mapScores.put("word2", 3);
> mapScores.put("word3", 1);
> panalyzer.setMapScores(mapScores);
> iwriter.addDocument(d, panalyzer);
> /*-----------------*/
> // iwriter.commit();
> iwriter.optimize();
> iwriter.close();
> BooleanQuery bq = new BooleanQuery();
> BoostingTermQuery tq = new BoostingTermQuery(new Term("text", "word1"));
> tq.setBoost((float) 1.0);
> bq.add(tq, BooleanClause.Occur.MUST);
> tq = new BoostingTermQuery(new Term("text", "word2"));
> tq.setBoost((float) 3);
> bq.add(tq, BooleanClause.Occur.SHOULD);
> tq = new BoostingTermQuery(new Term("text", "word3"));
> tq.setBoost((float) 1);
> bq.add(tq, BooleanClause.Occur.SHOULD);
> IndexSearcher searcher1 = new IndexSearcher("TestSearchIndex");
> searcher1.setSimilarity(new WordsSimilarity());
> TopDocs topDocs = searcher1.search(bq, null, 3);
> Hits hits1 = searcher1.search(bq);
> for(int j = 0; j < hits1.length(); j++)
> {
> Explanation explanation = searcher1.explain(bq, j);
> System.out.println("**** " + hits1.score(j) + " " +
> hits1.doc(j).getField("id").stringValue() + " *****");
> System.out.println(explanation.toString());
> explanation.getValue();
>
> System.out.println("********************************************************");
> System.out.println("Score " + topDocs.scoreDocs[j].score + " doc " +
> searcher1.doc(topDocs.scoreDocs[j].doc).getField("id").stringValue());
>
> System.out.println("********************************************************");
> }
>
> If you try the same query with differnt boosting, you will get a different
> order for the documents.
>
> Does it look ok?
>
> Thanks again!
> Liat
> 2009/4/25 Murat Yakici <[email protected]>
>
>>
>>
>> Here is what I am doing, not so magical... There are two classes, an
>> analyzer and an a TokenStream in which I can inject my document
>> dependent
>> data to be stored as payload.
>>
>>
>> private PayloadAnalyzer panalyzer = new PayloadAnalyzer();
>>
>> private class PayloadAnalyzer extends Analyzer {
>>
>> private PayloadTokenStream payToken = null;
>> private int score;
>>
>> public synchronized void setScore(int s) {
>> score=s;
>> }
>>
>> public final TokenStream tokenStream(String field, Reader reader) {
>> payToken = new PayloadTokenStream(new
>> LowerCaseTokenizer(reader));
>> payToken.setScore(score);
>> return payToken;
>> }
>> }
>>
>> private class PayloadTokenStream extends TokenStream {
>>
>> private Tokenizer tok = null;
>> private int score;
>>
>> public PayloadTokenStream(Tokenizer tokenizer) {
>> tok = tokenizer;
>> }
>>
>> public void setScore(int s) {
>> score = s;
>> }
>>
>> public Token next(Token t) throws IOException {
>> t = tok.next(t);
>> if (t != null) {
>> //t.setTermBuffer("can change");
>> //Do something with the data
>> byte[] bytes = ("score:"+ score).getBytes();
>> t.setPayload(new Payload(bytes));
>> }
>> return t;
>> }
>>
>> public void reset(Reader input) throws IOException {
>> tok.reset(input);
>> }
>>
>> public void close() throws IOException {
>> tok.close();
>> }
>> }
>>
>>
>> public void doIndex() {
>> try {
>> File index = new File("./TestPayloadIndex");
>> IndexWriter iwriter = new IndexWriter(index,
>> panalyzer,
>> IndexWriter.MaxFieldLength.UNLIMITED);
>>
>> Document d = new Document();
>> d.add(new Field("content",
>> "Everyone, someone, myTerm, yourTerm", Field.Store.YES,
>> Field.Index.ANALYZED, Field.TermVector.YES));
>> //We set the score for the term of the document that will be
>> analyzed.
>> /*I was worried about this part - document dependent score
>> which may be utilized*/
>> panalyzer.setScore(5);
>> iwriter.addDocument(d, panalyzer);
>> /*-----------------*/
>> ...
>> iwriter.commit();
>> iwriter.optimize();
>> iwriter.close();
>>
>> //Now read the index
>> IndexReader ireader = IndexReader.open(index);
>> TermPositions tpos = ireader.termPositions(
>> new Term("content","myterm"));//Note
>> LowercaseTokenizer
>> while (tpos.next()) {
>> int pos;
>> for(int i=0;i<tpos.freq();i++){
>> pos=tpos.nextPosition();
>> if (tpos.isPayloadAvailable()) {
>> byte[] data = new byte[tpos.getPayloadLength()];
>> tpos.getPayload(data, 0);
>> //Utilise payloads;
>> }
>> }
>> }
>>
>> tpos.close();
>> } catch (CorruptIndexException ex) {
>> //
>> } catch (LockObtainFailedException ex) {
>> //
>> } catch (IOException ex) {
>> //
>> }
>> }
>>
>> I wish it was designed better... Please let me know if you guys have a
>> better idea.
>>
>> Cheers,
>> Murat
>>
>> > Dear Murat,
>> >
>> > I saw your question and wondered how did you implement these changes?
>> > The requirement below are the same ones as I am trying to code now.
>> > Did you modify the source code itself or only used Lucene's jar and
>> just
>> > override code?
>> >
>> > I would very much apprecicate if you could give me a short explanation
>> on
>> > how was it done.
>> >
>> > Thanks a lot,
>> > Liat
>> >
>> > 2009/4/21 Murat Yakici <[email protected]>
>> >
>> >> Hi,
>> >> I started playing with the experimental payload functionality. I have
>> >> written an analyzer which adds a payload (some sort of a score/boost)
>> >> for
>> >> each term occurance. The payload/score for each term is dependent on
>> the
>> >> document that the term comes from (I guess this is the typoical use
>> >> case).
>> >> So say term t1 may have a payload of 5 in doc1 and 34 in doc5. The
>> >> parameter
>> >> for calculating the payload changes after each
>> >> indexWriter.addDocument(..)
>> >> method call in a while loop. I am assuming that the
>> >> indexWriter.addDocument(..) methods are thread safe. Can I confirm
>> this?
>> >>
>> >> Cheers,
>> >>
>> >> --
>> >> Murat Yakici
>> >> Department of Computer & Information Sciences
>> >> University of Strathclyde
>> >> Glasgow, UK
>> >> -------------------------------------------
>> >> The University of Strathclyde is a charitable body, registered in
>> >> Scotland,
>> >> with registration number SC015263.
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>> >
>>
>>
>> Murat Yakici
>> Department of Computer & Information Sciences
>> University of Strathclyde
>> Glasgow, UK
>> -------------------------------------------
>> The University of Strathclyde is a charitable body, registered in
>> Scotland,
>> with registration number SC015263.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
Murat Yakici
Department of Computer & Information Sciences
University of Strathclyde
Glasgow, UK
-------------------------------------------
The University of Strathclyde is a charitable body, registered in Scotland,
with registration number SC015263.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]