Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Alex Ksikes Tue, 06 May 2014 02:18:30 -0700

Hi Zoran,

If you are looking for exact duplicates then hashing the file content, and 
doing a search for that hash would do the job. If you are looking for near 
duplicates, then I would recommend extracting whatever text you have in 
your html, pdf, doc, indexing that and running more like this with 
like_text set to that content. Additionally you can perform a mlt search on 
more fields including the meta-data fields extracted with the attachment 
plugin. Hope this helps.


Alex

On Monday, May 5, 2014 8:08:30 PM UTC+2, Zoran Jeremic wrote:
>
> Hi Alex,
>
> Thank you for your explanation. It makes sense now. However, I'm not sure 
> I understood your proposal. 
>
> So I would adjust the mlt_fields accordingly, and possibly extract the 
> relevant portions of texts manually
> What do you mean by adjusting mlt_fields? The only shared field that is 
> guaranteed to be same is file. Different users could add different titles 
> to documents, but attach same or almost the same documents. If I compare 
> documents based on the other fields, it doesn't mean that it will match, 
> even though attached files are exactly the same.
> I'm also not sure what did you mean by extract the relevant portions of 
> text manually. How would I do that and what to do with it?
>
> Thanks,
> Zoran
>  
>
> On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote:
>>
>> Hi Zoran,
>>
>> Using the attachment type, you can text search over the attached document 
>> meta-data, but not its actual content, as it is base 64 encoded. So I would 
>> adjust the mlt_fields accordingly, and possibly extract the relevant 
>> portions of texts manually. Also set percent_terms_to_match = 0, to ensure 
>> that all boolean clauses match. Let me know how this works out for you.
>>
>> Cheers,
>>
>> Alex
>>
>> On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:
>>>
>>> Hi guys,
>>>
>>> I have a document that stores a content of html file, pdf, doc  or other 
>>> textual document in one of it's fields as byte array using attachment 
>>> plugin. Mapping is as follows:
>>>
>>> { "document":{
>>>         "properties":{
>>>              "title":{"type":"string","store":true },
>>>              "description":{"type":"string","store":"yes"},
>>>              "contentType":{"type":"string","store":"yes"},
>>>              "url":{"store":"yes", "type":"string"},
>>>               "visibility": { "store":"yes", "type":"string"},
>>>               "ownerId": {"type": "long",   "store":"yes" },
>>>               "relatedToType": { "type": "string", "store":"yes" },
>>>               "relatedToId": {"type": "long", "store":"yes" },
>>>               "file":{
>>>                     "path": "full","type":"attachment",
>>>                     "fields":{
>>>                         "author": { "type": "string" },
>>>                         "title": { "store": true,"type": "string" },
>>>                         "keywords": { "type": "string" },
>>>                         "file": { "store": true, "term_vector": 
>>> "with_positions_offsets","type": "string" },
>>>                         "name": { "type": "string" },
>>>                         "content_length": { "type": "integer" },
>>>                         "date": { "format": "dateOptionalTime", "type": 
>>> "date" },
>>>                         "content_type": { "type": "string" }
>>>     }
>>>     }}
>>> And the code I'm using to store the document is:
>>>
>>> VisibilityType.PUBLIC
>>>
>>> These files seems to be stored fine and I can search content. However, I 
>>> need to identify if there are duplicates of web pages or files stored in 
>>> ES, so I don't return the same documents to the user as search or 
>>> recommendation result. My expectation was that I could use MoreLikeThis 
>>> after the document was indexed to identify if there are duplicates of that 
>>> document and accordingly to mark it as duplicate. However, results look 
>>> weird for me, or I don't understand very well how MoreLikeThis works.
>>>
>>> For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 
>>> times, and all 3 documents in ES have exactly the same binary content 
>>> under file. Then for the following query:
>>>
>>> http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
>>> where ID is id of one of these documents I got these results:
>>> http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
>>> http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
>>> http://en.wikipedia.org/wiki/Computational_linguistics with score 
>>> 0.48509508
>>> ...
>>>
>>> For some other examples, scores for the same documents are much lower, 
>>> and sometimes (though not that often) I don't get duplicates on the first 
>>> positions. I would expect here to have score 1.0 or higher for documents 
>>> that are exactly the same, but it's not the case, and I can't figure out 
>>> how could I identify if there are duplicates in the Elasticsearch index.
>>>
>>> I would appreciate if somebody could explain if this is expected 
>>> behaviour or I didn't use it properly.
>>>
>>> Thanks,
>>> Zoran
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3f93c682-8f64-463c-95c9-007c63560370%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates

Reply via email to