Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Alex Ksikes Mon, 05 May 2014 01:24:14 -0700

Hi Zoran,

Using the attachment type, you can text search over the attached document 
meta-data, but not its actual content, as it is base 64 encoded. So I would 
adjust the mlt_fields accordingly, and possibly extract the relevant 
portions of texts manually. Also set percent_terms_to_match = 0, to ensure 
that all boolean clauses match. Let me know how this works out for you.


Cheers,

Alex

On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:
>
> Hi guys,
>
> I have a document that stores a content of html file, pdf, doc  or other 
> textual document in one of it's fields as byte array using attachment 
> plugin. Mapping is as follows:
>
> { "document":{
>         "properties":{
>              "title":{"type":"string","store":true },
>              "description":{"type":"string","store":"yes"},
>              "contentType":{"type":"string","store":"yes"},
>              "url":{"store":"yes", "type":"string"},
>               "visibility": { "store":"yes", "type":"string"},
>               "ownerId": {"type": "long",   "store":"yes" },
>               "relatedToType": { "type": "string", "store":"yes" },
>               "relatedToId": {"type": "long", "store":"yes" },
>               "file":{
>                     "path": "full","type":"attachment",
>                     "fields":{
>                         "author": { "type": "string" },
>                         "title": { "store": true,"type": "string" },
>                         "keywords": { "type": "string" },
>                         "file": { "store": true, "term_vector": 
> "with_positions_offsets","type": "string" },
>                         "name": { "type": "string" },
>                         "content_length": { "type": "integer" },
>                         "date": { "format": "dateOptionalTime", "type": 
> "date" },
>                         "content_type": { "type": "string" }
>     }
>     }}
> And the code I'm using to store the document is:
>
> VisibilityType.PUBLIC
>
> These files seems to be stored fine and I can search content. However, I 
> need to identify if there are duplicates of web pages or files stored in 
> ES, so I don't return the same documents to the user as search or 
> recommendation result. My expectation was that I could use MoreLikeThis 
> after the document was indexed to identify if there are duplicates of that 
> document and accordingly to mark it as duplicate. However, results look 
> weird for me, or I don't understand very well how MoreLikeThis works.
>
> For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 
> times, and all 3 documents in ES have exactly the same binary content 
> under file. Then for the following query:
>
> http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
> where ID is id of one of these documents I got these results:
> http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
> http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
> http://en.wikipedia.org/wiki/Computational_linguistics with score 
> 0.48509508
> ...
>
> For some other examples, scores for the same documents are much lower, and 
> sometimes (though not that often) I don't get duplicates on the first 
> positions. I would expect here to have score 1.0 or higher for documents 
> that are exactly the same, but it's not the case, and I can't figure out 
> how could I identify if there are duplicates in the Elasticsearch index.
>
> I would appreciate if somebody could explain if this is expected behaviour 
> or I didn't use it properly.
>
> Thanks,
> Zoran
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7a98b6da-7ff9-4e7a-ab4e-a43d79bb0a50%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Reply via email to