One of the concern with the mapper attachment is that you have to provide the 
full document (100kb) even if you will at the end extract only one single 
character.
Also, by default, _source is stored. That means you BASE64 encoded field will 
be stored as is in elasticsearch.

You can disable _source or you can also remove some part of the source using 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-source-field.html#include-exclude

Also, _all field which is enable by default also index the content a second 
time. You may want to disable it.

1/ You don't have to set _content_type. It will be automatically set by the 
plugin. If you force it, you need to make sure it corresponds to the actual 
content.
2/ Do you mean file extension? No we don't care about filename or extension…

I hope this helps

-- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr




Le 23 juin 2014 à 22:46:39, Deepikaa Subramaniam ([email protected]) 
a écrit:

I am using ElasticSearch mapper plugin for indexing contents for pdf, xls, ppt 
file types. My mapping is as follows:

Indexing of the documents seems to be working fine and I am getting expected 
results. However, when I look at the actual index size, it increases linearly 
with the file size. In other words, if I index 100KB pdf, the actual index size 
increases by ~100KB. Ideally, mapper should have extracted only text data and 
index it. However, it doesn't seem to do soI have following two questions:

Is it required to specify "content_type" for indexing contents of "non-text" 
files?
What is the right way of doing content indexing? Doesn't mapper take care of 
file types? Based on their documentation, it looks like they do. However, it 
doens't seem to be the case during implementation.
Using ElasticSearch Nest for C#


    [ElasticType(
        Name = "IndexDocument",
        SearchAnalyzer = "standard",
        IndexAnalyzer = "standard",
        DateDetection = true,
        NumericDetection = true
    )]
    public class Document
    {
        public string id { get; set; }
        [ElasticProperty(Type = Nest.FieldType.attachment, Store = false, 
TermVector = Nest.TermVectorOption.with_positions_offsets)]
        public ESAttachment esAttachment { get; set; }
    }

    public class ESAttachment
    {
        public string _content_type { get; set; }
        public string _name { get; set; }
        public string content { get; set; }
    }

Here is the code for indexing:


        esClient.MapFromAttributes<Document>();

        var item = new Document();
        item.esAttachment = new ESAttachment();
        item.esAttachment._content_type = "application/pdf";
        item.esAttachment.content = 
Convert.ToBase64String(System.IO.File.ReadAllBytes(file));
        item.esAttachment._name = "test-pdf";

        List<Document> bulkDoc = new List<Document>();
        bulkDoc.Add(item);

        var des = new BulkDescriptor();
        foreach (var doc in bulkDoc)
        {
            des.Index<Document>(j => j.Object(doc).Index("indexname"));
        }

        var status = esClient.BulkAsync(des);
--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/02b8b822-ed47-4da5-901b-07b020179614%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/etPan.53a927e5.6b8b4567.950f%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.

Reply via email to