Hi All,
   I am having an issue writing rules for an application that has some 
interesting requirements.  Consider that we are categorizing say news feeds 
into specific categories based on specific rule matches (news article contains 
the fact "NBA" ,or "NFL", or "NHL", etc... so categorize as a sports article; 
or the news article contains "Bon Jovi" so categorize the document under music, 
etc...)  We have entity extractors that pull out all of the words (tokens) and 
metadata associated with this metadata such as position in a document (for 
proximity type matching) and add all of these to Jess' facts list.  Our rules 
are setup like the following:

<code>
(defrule match-token-bon-jovi
  (Token (content ?content1&/^[Bb][\\w&&[^ ]]$/)
 (documentPosition ?pos1)
 (Token (content ?content2&/^[Jj][\\w&&[^ ]]$/) (documentPosition = (+ ?pos1 
1)))
=>
 (bind ?completeMatch (str-cat ?content1 " " ?content2))
 (if (stemMatch ?completeMatch "Bon Jovi") then
    (addCategory "Music")))
</code>

Now stemMatch is a function that we created that will stem the tokens (possibly 
removing some characters from the token like 's' or 'ed', etc.. at the end of a 
word which is why the regex only includes the first letter of each token and 
not the whole string) and addCategory simply adds a category which we later 
retrieve.  Structuring the rules this way is really slow and can take several 
minutes to run a rule set of say 100 rules on a fact set of about 50,000.  At 
this point the only way that I can see to speed it up is to remove the 
stemMatch function on the RHS and change the regex on the LHS to actually match 
what the stemMatch function was accomplishing.  Is there any other way that 
could speed up the process that I'm missing?  Any suggestions would be greatly 
appreciated.  Thanks for your help in advance.  Take care.


Reply via email to