Thanks. I tried a simple example below and it doesn't seem to strip html,
What's missing?
//yml
index :
analysis :
analyzer :
messageAnalyzer :
type : custom
tokenizer : standard
filter : standard
char_filter : [my_html]
char_filter :
my_html :
type : html_strip
read_ahead : 1024
//Create Index
PUT /twitter
{
"mappings": {
"message" : {
"properties" : {
"message" : {
"type" : "string",
"analyzer": "messageAnalyzer"
},
"date" : {
"type" : "date"
},
"name" : {
"type" : "string"
}
}
}
}
}
//Index a document
PUT /twitter/tweet/1
{
"name" : "mike",
"date" : "2009-11-15T14:12:12",
"message" : "<html>trying out <b>Elasticsearch</b>, This is an html
test</html>"
}
//Search for "ElasticSearch" yields html still
"fields": {
"message": [
"<html>trying out <b>Elasticsearch</b>, This is an html
test</html>"
]
On Wednesday, August 6, 2014 2:59:53 PM UTC-4, Ivan Brusic wrote:
>
> 1. Correct.
> 2. Also correct. The analysis chain only affects how the terms are indexed
> and placed in the inverted index. The original document remains as is.
> 3. Not sure since I have never done highlighting. Highlighting might not
> depend on the source since the term positions/offsets are used, but
> hopefully someone will correct me.
>
> --
> Ivan
>
>
> On Wed, Aug 6, 2014 at 11:45 AM, IronMike <[email protected]
> <javascript:>> wrote:
>
>> I searched this topic but some of the answers were still vague to me.
>>
>> My goal is to index html docs but have the html stripped for the
>> indexing, at the same time, I would like _source to have the original html
>> document for display purposes.
>>
>> //My doc format:
>> {
>> content: <html> Hello this is an html <b>content</b> ....</html>
>> rank:1
>> date:2014-8-8
>> title: Some title
>> ....
>> }
>>
>> The questions that I am still not very clear on:
>>
>> 1 - if I understand correctly, I can push html doc like it is to Index,
>> and it will strip html provided I do the charfilter referenced here?
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html
>>
>> 2- Will the stripping not affect the _source? In other words, _source
>> will still have the original html?
>>
>> 3- Highlighting comes from the _source? this means highlighting will have
>> html, meaning I will have to strip any html tags after the search comes
>> back?
>>
>>
>> Thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/6be77d25-f7fe-4a35-a247-932f93f07150%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/87574845-6195-4904-bb1f-d8e9c662c177%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.