[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783010#action_12783010
 ] 

Thomas D'Silva commented on LUCENE-1910:
----------------------------------------

Mark,

I refactored the code so that the tag and document probabilities are computed 
and used to find the most important document terms corresponding to a given tag 
term during the index creation phase. These most important document terms 
(ranked by information gain) for a given tag term is stored as meta information 
in the index when the index is created. I added a class TagIndexWriter which 
extends IndexWriter which is used to create an index which can be used to run 
MoreLikeThisUsingTags queries. 

I recreated a test index with one million documents, and assigned tags 
(tag_0,...tag_4) to 10%,20%.. and so on of the documents. 

The time taken to generate a query on an index created using TagIndexWriter:
tag name, number of documents, time in ms
tag_0, 10134, 22
tag_1, 19996, 29
tag_2, 30010, 6
tag_3, 39907, 6
tag_4, 50148, 9

Since the document terms corresponding to a tag term is computed during the 
indexing phase, the time taken to generate a MoreLikeThisUsingTags query is 
constant. 

Thanks,
Thomas

> Extension to MoreLikeThis to use tag information
> ------------------------------------------------
>
>                 Key: LUCENE-1910
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1910
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Thomas D'Silva
>            Priority: Minor
>
> I would like to contribute a class based on the MoreLikeThis class in
> contrib/queries that generates a query based on the tags associated
> with a document. The class assumes that documents are tagged with a
> set of tags (which are stored in the index in a seperate Field). The
> class determines the top document terms associated with a given tag
> using the information gain metric.
> While generating a MoreLikeThis query for a document the tags
> associated with document are used to determine the terms in the query.
> This class is useful for finding similar documents to a document that
> does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to