MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Zoran Jeremic Sun, 04 May 2014 20:50:28 -0700

Hi guys,

I have a document that stores a content of html file, pdf, doc  or other 
textual document in one of it's fields as byte array using attachment 
plugin. Mapping is as follows:


{ "document":{
        "properties":{
             "title":{"type":"string","store":true },
             "description":{"type":"string","store":"yes"},
             "contentType":{"type":"string","store":"yes"},
             "url":{"store":"yes", "type":"string"},
              "visibility": { "store":"yes", "type":"string"},
              "ownerId": {"type": "long",   "store":"yes" },
              "relatedToType": { "type": "string", "store":"yes" },
              "relatedToId": {"type": "long", "store":"yes" },
              "file":{
                    "path": "full","type":"attachment",
                    "fields":{
                        "author": { "type": "string" },
                        "title": { "store": true,"type": "string" },
                        "keywords": { "type": "string" },
                        "file": { "store": true, "term_vector": 
"with_positions_offsets","type": "string" },
                        "name": { "type": "string" },
                        "content_length": { "type": "integer" },
                        "date": { "format": "dateOptionalTime", "type": 
"date" },
                        "content_type": { "type": "string" }
    }
    }}
And the code I'm using to store the document is:

VisibilityType.PUBLIC

These files seems to be stored fine and I can search content. However, I 
need to identify if there are duplicates of web pages or files stored in 
ES, so I don't return the same documents to the user as search or 
recommendation result. My expectation was that I could use MoreLikeThis 
after the document was indexed to identify if there are duplicates of that 
document and accordingly to mark it as duplicate. However, results look 
weird for me, or I don't understand very well how MoreLikeThis works.

For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics 3 
times, and all 3 documents in ES have exactly the same binary content under 
file. Then for the following query:
http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score 0.48509508
...

For some other examples, scores for the same documents are much lower, and 
sometimes (though not that often) I don't get duplicates on the first 
positions. I would expect here to have score 1.0 or higher for documents 
that are exactly the same, but it's not the case, and I can't figure out 
how could I identify if there are duplicates in the Elasticsearch index.

I would appreciate if somebody could explain if this is expected behaviour 
or I didn't use it properly.

Thanks,
Zoran

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3c5fd0da-e192-4c54-85d5-63c84f3acafc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Reply via email to