I am using ElasticSearch mapper plugin for indexing contents for pdf, xls, 
ppt file types. My mapping is as follows:

Indexing of the documents seems to be working fine and I am getting 
expected results. However, when I look at the actual index size, it 
increases linearly with the file size. In other words, if I index 100KB 
pdf, the actual index size increases by ~100KB. Ideally, mapper should have 
extracted only text data and index it. However, it doesn't seem to do soI 
have following two questions:

   1. Is it required to specify "content_type" for indexing contents of 
   "non-text" files?
   2. What is the right way of doing content indexing? Doesn't mapper take 
   care of file types? Based on their documentation, it looks like they do. 
   However, it doens't seem to be the case during implementation.

Using ElasticSearch Nest for C#

    [ElasticType(
        Name = "IndexDocument",
        SearchAnalyzer = "standard",
        IndexAnalyzer = "standard",
        DateDetection = true,
        NumericDetection = true
    )]
    public class Document
    {
        public string id { get; set; }
        [ElasticProperty(Type = Nest.FieldType.attachment, Store = false, 
TermVector = Nest.TermVectorOption.with_positions_offsets)]
        public ESAttachment esAttachment { get; set; }
    }

    public class ESAttachment
    {
        public string _content_type { get; set; }
        public string _name { get; set; }
        public string content { get; set; }
    }

Here is the code for indexing:

        esClient.MapFromAttributes<Document>();

        var item = new Document();
        item.esAttachment = new ESAttachment();
        item.esAttachment._content_type = "application/pdf";
        item.esAttachment.content = 
Convert.ToBase64String(System.IO.File.ReadAllBytes(file));
        item.esAttachment._name = "test-pdf";

        List<Document> bulkDoc = new List<Document>();
        bulkDoc.Add(item);

        var des = new BulkDescriptor();
        foreach (var doc in bulkDoc)
        {
            des.Index<Document>(j => j.Object(doc).Index("indexname"));
        }

        var status = esClient.BulkAsync(des);

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/02b8b822-ed47-4da5-901b-07b020179614%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to