Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hello,

> Hello,
> 
> I've installed and configured MnoGoSearch as a powerful full text search 
> engine for 
> CMS websites a few days ago. But right now I am a little bit confused about 
> the 
> configuration of document sections.
> 
> I would like to index the headlines (<h1>, <h2>, <h3>) in special fields so 
> that I 
> can weight them more in comparison to the body text.
> 
> There is one example given in indexer.conf:
> Section h1  26  128  "<h1>(.*)</h1>" $1
> 
> This works fine because normally there is only one <h1> on a webpage. But 
> when I try 
> to index all <h2> headlines using the regular expression "<h2>(.*)</h2>" $1, 
> the 
> whole content between the first <h2> and the last <h2> gets indexed. What I 
> would 
> like to get is only the text between the <h2>...</h2> tags.
> 
> Could somebody please tell me if there is a solution for that problem?

There are two problems here:
1. Nested tags: <h2>...<xxx>...</xxx>...</h2>

Unfortunately, there is no a general solution for this,
because the underlying regexp library does not support
so called "non-greedy quantifiers". We definitely need
to switch to the PCRE library eventually, to make it possible.

But there is a workaround that I think should work for <h2> and <h3>.
The idea is that <h2> and <h3> usually do not have nested tags,
so the regexp can scan everything until the next '<' character:

Section h2  27  128  "<h2>([^<]*)</h2>" $1
Section h3  28  128  "<h3>([^<]*)</h3>" $1

It will work for: <h2>text text</h2>

It will not work for: <h2>text <xxx>text</xxx> text</h2>
where xxx is some other tag. 

Do you know any tags that are possible inside <h2></h2> or <h3></h3>?


2. Multiple <h2> or <h3> tags.
The user defined sections do not support multiple entries.
They catch only the first match. Adding support for multiple
matches (e.g. to concatenate them) will need some coding.


> 
> Thanks a lot for your help
> Felix


Reply: <http://www.mnogosearch.org/board/message.php?id=21591>

_______________________________________________
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general

Reply via email to