Re: Stripping html for indexing only?

IronMike Thu, 07 Aug 2014 09:38:01 -0700

Thanks. I tried a simple example below and it doesn't seem to strip html, 
What's missing?


//yml
index :
    analysis :
        analyzer :
            messageAnalyzer :
                type : custom
                tokenizer : standard
                filter : standard
                char_filter : [my_html]
        char_filter :
              my_html :
                type : html_strip
                read_ahead : 1024


//Create Index
PUT /twitter 
{
  "mappings": {
    "message" : {
      "properties" : {
        "message" : {
          "type" :    "string",
          "analyzer": "messageAnalyzer"
        },
        "date" : {
          "type" :   "date"
        },
        "name" : {
          "type" :   "string"
        }
      }
    }
  }
}


//Index a document
PUT /twitter/tweet/1
{
    "name" : "mike",
    "date" : "2009-11-15T14:12:12",
    "message" : "<html>trying out <b>Elasticsearch</b>, This is an html 
test</html>"
}


//Search for "ElasticSearch" yields html still
"fields": {
               "message": [
                  "<html>trying out <b>Elasticsearch</b>, This is an html 
test</html>"
               ]





On Wednesday, August 6, 2014 2:59:53 PM UTC-4, Ivan Brusic wrote:
>
> 1. Correct.
> 2. Also correct. The analysis chain only affects how the terms are indexed 
> and placed in the inverted index. The original document remains as is.
> 3. Not sure since I have never done highlighting. Highlighting might not 
> depend on the source since the term positions/offsets are used, but 
> hopefully someone will correct me.
>
> -- 
> Ivan
>
>
> On Wed, Aug 6, 2014 at 11:45 AM, IronMike <[email protected] 
> <javascript:>> wrote:
>
>> I searched this topic but some of the answers were still vague to me.
>>
>> My goal is to index html docs but have the html stripped for the 
>> indexing, at the same time, I would like _source to have the original html 
>> document for display purposes.
>>
>> //My doc format:
>> {
>>   content: <html> Hello this is an html <b>content</b> ....</html>
>>   rank:1
>>   date:2014-8-8
>>   title: Some title
>>   ....
>> }
>>
>> The questions that I am still not very clear on:
>>
>> 1 - if I understand correctly, I can push html doc like it is to Index, 
>> and it will strip html provided I do the charfilter referenced here?
>>     
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
>>
>> 2- Will the stripping not affect the _source? In other words, _source 
>> will still have the original html?
>>
>> 3- Highlighting comes from the _source? this means highlighting will have 
>> html, meaning I will have to strip any html tags after the search comes 
>> back?
>>
>>
>> Thanks
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/87574845-6195-4904-bb1f-d8e9c662c177%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Stripping html for indexing only?

Reply via email to