Hi guys,
I have a document that stores a content of html file, pdf, doc or other
textual document in one of it's fields as byte array using attachment
plugin. Mapping is as follows:
{ "document":{
"properties":{
"title":{"type":"string","store":true },
"description":{"type":"string","store":"yes"},
"contentType":{"type":"string","store":"yes"},
"url":{"store":"yes", "type":"string"},
"visibility": { "store":"yes", "type":"string"},
"ownerId": {"type": "long", "store":"yes" },
"relatedToType": { "type": "string", "store":"yes" },
"relatedToId": {"type": "long", "store":"yes" },
"file":{
"path": "full","type":"attachment",
"fields":{
"author": { "type": "string" },
"title": { "store": true,"type": "string" },
"keywords": { "type": "string" },
"file": { "store": true, "term_vector":
"with_positions_offsets","type": "string" },
"name": { "type": "string" },
"content_length": { "type": "integer" },
"date": { "format": "dateOptionalTime", "type":
"date" },
"content_type": { "type": "string" }
}
}}
And the code I'm using to store the document is:
VisibilityType.PUBLIC
These files seems to be stored fine and I can search content. However, I
need to identify if there are duplicates of web pages or files stored in
ES, so I don't return the same documents to the user as search or
recommendation result. My expectation was that I could use MoreLikeThis
after the document was indexed to identify if there are duplicates of that
document and accordingly to mark it as duplicate. However, results look
weird for me, or I don't understand very well how MoreLikeThis works.
For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics 3
times, and all 3 documents in ES have exactly the same binary content under
file. Then for the following query:
http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score 0.48509508
...
For some other examples, scores for the same documents are much lower, and
sometimes (though not that often) I don't get duplicates on the first
positions. I would expect here to have score 1.0 or higher for documents
that are exactly the same, but it's not the case, and I can't figure out
how could I identify if there are duplicates in the Elasticsearch index.
I would appreciate if somebody could explain if this is expected behaviour
or I didn't use it properly.
Thanks,
Zoran
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3c5fd0da-e192-4c54-85d5-63c84f3acafc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.