Hi. What i need to achieve is a better html documents indexing.
I started with first analyzer that strips html chars and works with text only, but almost half om my searches will be through html tags (and more - some specific html attributes). For example, i have an index with content field that stores html page content and search might look like *name="generator" http-equiv="Wordpress 3.1"* or it might look like *<script src="jquery.js">* So i wonder if there is a way to create a tokenizer that would use only html tags and split them in pieces (space is ok), so that we get something like 'html', 'name="generator"', 'src="jquery.js"'. All i ws able to achieve so far is tokenizing each tag as single token (with all params in it). Obviously this won't work... Will be glad to hear any suggestions. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/635d2b97-8dd5-4266-b60e-40300d986828%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
