Hi Zoran,
If you are looking for exact duplicates then hashing the file content, and
doing a search for that hash would do the job. If you are looking for near
duplicates, then I would recommend extracting whatever text you have in
your html, pdf, doc, indexing that and running more like this with
like_text set to that content. Additionally you can perform a mlt search on
more fields including the meta-data fields extracted with the attachment
plugin. Hope this helps.
Alex
On Monday, May 5, 2014 8:08:30 PM UTC+2, Zoran Jeremic wrote:
>
> Hi Alex,
>
> Thank you for your explanation. It makes sense now. However, I'm not sure
> I understood your proposal.
>
> So I would adjust the mlt_fields accordingly, and possibly extract the
> relevant portions of texts manually
> What do you mean by adjusting mlt_fields? The only shared field that is
> guaranteed to be same is file. Different users could add different titles
> to documents, but attach same or almost the same documents. If I compare
> documents based on the other fields, it doesn't mean that it will match,
> even though attached files are exactly the same.
> I'm also not sure what did you mean by extract the relevant portions of
> text manually. How would I do that and what to do with it?
>
> Thanks,
> Zoran
>
>
> On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote:
>>
>> Hi Zoran,
>>
>> Using the attachment type, you can text search over the attached document
>> meta-data, but not its actual content, as it is base 64 encoded. So I would
>> adjust the mlt_fields accordingly, and possibly extract the relevant
>> portions of texts manually. Also set percent_terms_to_match = 0, to ensure
>> that all boolean clauses match. Let me know how this works out for you.
>>
>> Cheers,
>>
>> Alex
>>
>> On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:
>>>
>>> Hi guys,
>>>
>>> I have a document that stores a content of html file, pdf, doc or other
>>> textual document in one of it's fields as byte array using attachment
>>> plugin. Mapping is as follows:
>>>
>>> { "document":{
>>> "properties":{
>>> "title":{"type":"string","store":true },
>>> "description":{"type":"string","store":"yes"},
>>> "contentType":{"type":"string","store":"yes"},
>>> "url":{"store":"yes", "type":"string"},
>>> "visibility": { "store":"yes", "type":"string"},
>>> "ownerId": {"type": "long", "store":"yes" },
>>> "relatedToType": { "type": "string", "store":"yes" },
>>> "relatedToId": {"type": "long", "store":"yes" },
>>> "file":{
>>> "path": "full","type":"attachment",
>>> "fields":{
>>> "author": { "type": "string" },
>>> "title": { "store": true,"type": "string" },
>>> "keywords": { "type": "string" },
>>> "file": { "store": true, "term_vector":
>>> "with_positions_offsets","type": "string" },
>>> "name": { "type": "string" },
>>> "content_length": { "type": "integer" },
>>> "date": { "format": "dateOptionalTime", "type":
>>> "date" },
>>> "content_type": { "type": "string" }
>>> }
>>> }}
>>> And the code I'm using to store the document is:
>>>
>>> VisibilityType.PUBLIC
>>>
>>> These files seems to be stored fine and I can search content. However, I
>>> need to identify if there are duplicates of web pages or files stored in
>>> ES, so I don't return the same documents to the user as search or
>>> recommendation result. My expectation was that I could use MoreLikeThis
>>> after the document was indexed to identify if there are duplicates of that
>>> document and accordingly to mark it as duplicate. However, results look
>>> weird for me, or I don't understand very well how MoreLikeThis works.
>>>
>>> For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3
>>> times, and all 3 documents in ES have exactly the same binary content
>>> under file. Then for the following query:
>>>
>>> http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
>>> where ID is id of one of these documents I got these results:
>>> http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
>>> http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
>>> http://en.wikipedia.org/wiki/Computational_linguistics with score
>>> 0.48509508
>>> ...
>>>
>>> For some other examples, scores for the same documents are much lower,
>>> and sometimes (though not that often) I don't get duplicates on the first
>>> positions. I would expect here to have score 1.0 or higher for documents
>>> that are exactly the same, but it's not the case, and I can't figure out
>>> how could I identify if there are duplicates in the Elasticsearch index.
>>>
>>> I would appreciate if somebody could explain if this is expected
>>> behaviour or I didn't use it properly.
>>>
>>> Thanks,
>>> Zoran
>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3f93c682-8f64-463c-95c9-007c63560370%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.