Hi All,
I am having an issue writing rules for an application that has some
interesting requirements. Consider that we are categorizing say news feeds
into specific categories based on specific rule matches (news article contains
the fact "NBA" ,or "NFL", or "NHL", etc... so categorize as a sports article;
or the news article contains "Bon Jovi" so categorize the document under music,
etc...) We have entity extractors that pull out all of the words (tokens) and
metadata associated with this metadata such as position in a document (for
proximity type matching) and add all of these to Jess' facts list. Our rules
are setup like the following:
<code>
(defrule match-token-bon-jovi
(Token (content ?content1&/^[Bb][\\w&&[^ ]]$/)
(documentPosition ?pos1)
(Token (content ?content2&/^[Jj][\\w&&[^ ]]$/) (documentPosition = (+ ?pos1
1)))
=>
(bind ?completeMatch (str-cat ?content1 " " ?content2))
(if (stemMatch ?completeMatch "Bon Jovi") then
(addCategory "Music")))
</code>
Now stemMatch is a function that we created that will stem the tokens (possibly
removing some characters from the token like 's' or 'ed', etc.. at the end of a
word which is why the regex only includes the first letter of each token and
not the whole string) and addCategory simply adds a category which we later
retrieve. Structuring the rules this way is really slow and can take several
minutes to run a rule set of say 100 rules on a fact set of about 50,000. At
this point the only way that I can see to speed it up is to remove the
stemMatch function on the RHS and change the regex on the LHS to actually match
what the stemMatch function was accomplishing. Is there any other way that
could speed up the process that I'm missing? Any suggestions would be greatly
appreciated. Thanks for your help in advance. Take care.