Author: Alexander Barkov Email: b...@mnogosearch.org Message: Hello, > Hello, > > I've installed and configured MnoGoSearch as a powerful full text search > engine for > CMS websites a few days ago. But right now I am a little bit confused about > the > configuration of document sections. > > I would like to index the headlines (<h1>, <h2>, <h3>) in special fields so > that I > can weight them more in comparison to the body text. > > There is one example given in indexer.conf: > Section h1 26 128 "<h1>(.*)</h1>" $1 > > This works fine because normally there is only one <h1> on a webpage. But > when I try > to index all <h2> headlines using the regular expression "<h2>(.*)</h2>" $1, > the > whole content between the first <h2> and the last <h2> gets indexed. What I > would > like to get is only the text between the <h2>...</h2> tags. > > Could somebody please tell me if there is a solution for that problem?
There are two problems here: 1. Nested tags: <h2>...<xxx>...</xxx>...</h2> Unfortunately, there is no a general solution for this, because the underlying regexp library does not support so called "non-greedy quantifiers". We definitely need to switch to the PCRE library eventually, to make it possible. But there is a workaround that I think should work for <h2> and <h3>. The idea is that <h2> and <h3> usually do not have nested tags, so the regexp can scan everything until the next '<' character: Section h2 27 128 "<h2>([^<]*)</h2>" $1 Section h3 28 128 "<h3>([^<]*)</h3>" $1 It will work for: <h2>text text</h2> It will not work for: <h2>text <xxx>text</xxx> text</h2> where xxx is some other tag. Do you know any tags that are possible inside <h2></h2> or <h3></h3>? 2. Multiple <h2> or <h3> tags. The user defined sections do not support multiple entries. They catch only the first match. Adding support for multiple matches (e.g. to concatenate them) will need some coding. > > Thanks a lot for your help > Felix Reply: <http://www.mnogosearch.org/board/message.php?id=21591> _______________________________________________ General mailing list General@mnogosearch.org http://lists.mnogosearch.org/listinfo/general