Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Alex Ksikes Wed, 07 May 2014 06:15:08 -0700

Hi Zoran,

In a nutshell 'more like this' creates a large boolean disjunctive query of 
'max_query_terms' number of interesting terms from a text specified in 
'like_text'. The interesting terms are picked up with respect to the their 
tf-idf scores in the whole corpus. These later parameters could be tuned 
with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The 
number of boolean clauses that must match is controlled by 
'percent_terms_to_match'. In the case of specifying only one field in 
'fields', the analyzer used to pick up the terms in 'like_text' is the one 
associated with the field, unless specified specified by 'analyzer'. So as 
an example, the default is to create a boolean query of 25 interesting 
terms where only 30% of the should clauses must match.


On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote:
>
> Hi Alex,
>
>
> If you are looking for exact duplicates then hashing the file content, and 
> doing a search for that hash would do the job.
> This trick won't work for me as these are not exact duplicates. For 
> example, I have 10 students working on the same 100 pages long word 
> document. Each of these students could change only one sentence and upload 
> a document. The hash will be different, but it's 99,99 % same documents. 
> I have the other service that uses mlt_like_text to recommend some 
> relevant documents, and my problem is if this document has best score, then 
> all duplicates will be among top hits and instead recommending users with 
> several most relevant documents I will recommend 10 instances of same 
> document. 
>

Could you please define "relevant" in your setting? In a corpus of very 
similar documents, is your goal to find the ones which are oddly different? 
Have you looked into ES significant terms?
 

> If you are looking for near duplicates, then I would recommend extracting 
> whatever text you have in your html, pdf, doc, indexing that and running 
> more like this with like_text set to that content.
> I tried that as well, and results are very disappointing, though I'm not 
> sure if that would be good idea having in mind that long textual documents 
> could be used. For testing purpose, I made a simple test with 10 web pages. 
> Maybe I'm making some mistake there. What I did is to index 10 web pages 
> and store it in document as attachment. Content is stored as byte[]. Then 
> I'm using the same 10 pages, extract content using Jsoup, and try to find 
> similar web pages. Here is the code that I used to find similar web pages 
> to the provided one:
> System.out.println("Duplicates for link:"+link);
>              System.out.println(
> "************************************************");
>              String indexName=ESIndexNames.INDEX_DOCUMENTS;
>              String indexType=ESIndexTypes.DOCUMENT;
>              String mapping = copyToStringFromClasspath(
> "/org/prosolo/services/indexing/document-mapping.json");
>              client.admin().indices().putMapping(putMappingRequest(
> indexName).type(indexType).source(mapping)).actionGet();
>              URL url = new URL(link);
>             org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
>               String html=doc.html(); //doc.text();
>              QueryBuilder qb = null;
>              // create the query
>              qb = QueryBuilders.moreLikeThisQuery("file")
>                      .likeText(html).minTermFreq(0).minDocFreq(0);
>              SearchResponse sr = client.prepareSearch(ESIndexNames.
> INDEX_DOCUMENTS)
>                      .setQuery(qb).addFields("url", "title", "contentType"
> )
>                      .setFrom(0).setSize(5).execute().actionGet();
>              if (sr != null) {
>                  SearchHits searchHits = sr.getHits();
>                  Iterator<SearchHit> hitsIter = searchHits.iterator();
>                  while (hitsIter.hasNext()) {
>                      SearchHit searchHit = hitsIter.next();
>                      System.out.println("Duplicate:" + searchHit.getId()
>                              + " title:"+searchHit.getFields().get("url").
> getValue()+" score:" + searchHit.getScore());
>                       }
>              }
>
> And results of the execution of this for each of 10 urls is:
>  
> Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic
> ************************************************
> Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
> en.wikipedia.org/wiki/Mathematical_logic score:0.3335998
> Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
> en.wikipedia.org/wiki/Chemistry score:0.16319205
> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
> en.wikipedia.org/wiki/Formal_science score:0.13035104
> Duplicate:1APeDW0KQnWRv_8mihrz4A 
> URL:http://en.wikipedia.org/wiki/Starscore:0.12292466
> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
> en.wikipedia.org/wiki/Crystallography score:0.117023855
>
> Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics
> ************************************************
> Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
> en.wikipedia.org/wiki/Mathematical_logic score:0.1570246
> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://
> en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403
> Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
> en.wikipedia.org/wiki/Chemistry score:0.09323166
> Duplicate:1APeDW0KQnWRv_8mihrz4A 
> URL:http://en.wikipedia.org/wiki/Starscore:0.09279101
> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
> en.wikipedia.org/wiki/Formal_science score:0.08606046
>
> Duplicates for link:http://en.wikipedia.org/wiki/Formal_science
> ************************************************
> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
> en.wikipedia.org/wiki/Formal_science score:0.12439237
> Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
> en.wikipedia.org/wiki/Chemistry score:0.11299215
> Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
> en.wikipedia.org/wiki/Mathematical_logic score:0.107585154
> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
> en.wikipedia.org/wiki/Crystallography score:0.07795183
> Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://
> en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285
>
> Duplicates for link:http://en.wikipedia.org/wiki/Star
> ************************************************
> Duplicate:1APeDW0KQnWRv_8mihrz4A 
> URL:http://en.wikipedia.org/wiki/Starscore:0.21684575
> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
> en.wikipedia.org/wiki/Crystallography score:0.15316588
> Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:http://
> en.wikipedia.org/wiki/Cosmology score:0.123572096
> Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
> en.wikipedia.org/wiki/Chemistry score:0.1177105
> Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
> en.wikipedia.org/wiki/Mathematical_logic score:0.11373919
>
> Duplicates for link:http://en.wikipedia.org/wiki/Chemistry
> ************************************************
> Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
> en.wikipedia.org/wiki/Chemistry score:0.13033955
> Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
> en.wikipedia.org/wiki/Crystallography score:0.121021904
> Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo
>

Here you should probably strip the html tags, and solely index the text in 
its own field. 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c30400c5-ce33-4cb7-9335-759b3923ae14%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Reply via email to