John: Sorry for the delay. I wanted to do a few more test before I responded.
I guess, let's begin with a couple of question for you. 1. What are your needs/goals w.r.t. URL normalization? 2. What is your setup currently? (feel free to email me directly if you prefer, but the more we can share with other, the more everyone benefits with our collective experiences) The RulesEngine "CAN" make sense for URLNormalizer, but it makes the process more complicated without giving much benefit to 80% of the people (A classic example of the 80/20 rule). Even, though we are in the 20%, I still feel a faster approach is to write something but hand -- the stuff is already in XML, add some grouping logic, some decision making ability...and it's done. -- Footprint: Off the 3 engines I looked at none of them were under 500k, start-up time is about 10 seconds (after which they can fire off thousands of rules a second). -- Benchmark: 110 Regular expressions in file (last 10 were the ones that would satisfy, causing 100 rules to be skipped). Put the same in a "commercial" rules engine (no names least they come after me asking for this year's maintenance fee) -- the runtime was pretty much equal for a Segment of 20,000 URLs. If R is rules and N is total URLs, this was a worst case runtime, O(RN) -- now look below. Keep in mind these as Forward-chaining engines. They work best when Rules are static and the facts (or events) are being triggered in real-time. Given that URLs are KNOWN before hand, and there only 3-4 classes of checks one needs to do on -- this type of a Rules engine is the wrong tool for the job. What's required here is a "Outlook Mail Rules" type of an engine (or build a simple hard-coded one). In our case (Filangy), given that most of the information is know before hand, we break our segments and rules out which more or less gives us a time complexity of O(Nlog(R)) -- Other Uses: Can Nutch Benefit from using a (RETE) Rules Engine. YES! I think it can greatly improve the QUALITY of results. And I mean GREATLY improve the QUALITY -- it can help to reduce the garbage that comes it, you can give custom boosts based on certain conditions, faster indexing, fire plugins faster. With one of these engines you should be able to go thru 1000+ "if then else" type statement in under a second. How is this helpful? Now one can easily analyze content while parsing to accept/reject/modify or set a custom boost factor which can override the log(inlink) if certain conditions are met. (We recently added functionality to our product that does this on a limited basis whereby the Nutch boost is combined with the user's Personal Score) Given that I have good experience with Rules Engines I'd be more happy to put together a document that outlines how using a Rules engine can benefit Nutch and the trigger points or type of rules that may be required. At that point I think some of the more (Nutch) seasoned developers can do a reality check of what is or what is not possible given the current architecture. Regards, CC -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of John X Sent: Wednesday, February 16, 2005 1:58 AM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [Nutch-dev] make URLFilter as plugin Hi, Chirag, Thanks for your detailed report. Do you think rules engine would be good for UrlNormalizer? Can nutch possibly benifit from rules engine in other ways? John On Mon, Feb 14, 2005 at 04:08:19PM -0500, Chirag Chaman wrote: > John: > > We did some research and ran some test on our system to better > understand the needs. > > NOTE: The finding below are based on keeping 90% OF NUTCH INSTALLS in > mind > -- simple, out of the box, small/medium size index. > > In short, using a rules engine for URL filtering is a waste of resources. > The rules engine should only be used for filtering pages/url based on > Content -- example, remove a pages because it matched Adult content, > less than 2K, text/tag ration low, etc. etc. Do you have any benchmark numbers? How about use rules engine for UrlNormalizer? > > Here are the reasons why it's bad for URL filtering: > > - First, lets get JESS out of the picture, it's not open source and > it's use in anything even close to commercial requires a commercial license ($$$). > > - The startup time involved with loading the Rules engine is high, and > neither are they light-weight (500-800k) memory footprint. Any light weight implementation with smaller memory footprint? > > - Most of the rules for URL filtering are Regular Exp.(REs), and REs > executes pretty fast in Java. Thus, we did not see a substantial > increase in performance unless we went to 100+ rules. Even then, with > the startup time for the Engine, plain simple stuff won out. NOTE: > speed can be improved by keeping the Engine loaded in memory at all > times (instead being loaded each time fetch process is run. Given that > WE do frequent indexing with segments of approx. 1000 pages, we'll > need it in memory at all times. This may not be the case if segments > are 100,000 or so). Either way, introducing a rules engine would > require a change to the way the plug-in is called as you need to create the RETE network upon startup/first call. > > - The rules engine is best when using a lot of 'if..then..else' > statements and the "facts" are unknown until runtime (and thus, why > it's great for Content filtering and bad for URL filtering). With URLs > we know what they are before we even begin. With Content we get all > the details during parsing and need to make a decision at that point). > > - Even for the regex-filter unless we are talking about 100+ filters > the startup time and the requirement to change the code makes other > simpler options more viable. For example, the XML based ACTION/GROUP > option I described earlier. > > > So, how does one attack the problem (Assuming you're looking for a > larger > deployment): > > We found that the bottleneck to a faster crawl and index is due to the > following: > 1. WebDB Size > 2. Recrawling Blocked URLs (not remembering domain status across > crawls) > > Point 1 should be intuitive -- the larger the DB, the more time is > takes to sort. The second point relates to the fact that the fetcher > does not remember the status of a domain across crawls -- if you are > blocked from a particular domain, future fetch lists should not even > contain URLs from that domain/directory. Another issue is when a > domain is down -- this should also be stored for a period of time (say 12 hours). > > Also, to reduce the size of the WebDB and only store "fetchable" URL's > in it, I think we should only add those links to the DB that would > otherwise pass the filters specified by the user (i.e run the filters > before adding links to the DB). > > To achieve the above we're creating a simple external database, which > runs like a service and keeps the status of domains. The DB will serve > 2 > functions: a. More or less a cache for robots.txt files, down domains b. > provide users with a way to block domains/directories. > > The goal is to catch and remove non-crawlable URLs before they make it > to the fetchlist, or better yet get added to the WebDB. A simple java > API will allow for a check to be made for a URL (think of this like a DNS server). > > I would appreciate your (and anyone eles's input) on any other needs > this should incorporate. This will be created using hsql or Berkeley > DB (unless there is a better option, both these are GPL), as the > underlying database for simplicity and development speed. > > > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > John X > Sent: Tuesday, February 08, 2005 6:02 PM > To: Chirag Chaman > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: [Nutch-dev] make URLFilter as plugin > > On Tue, Feb 08, 2005 at 09:41:28AM -0500, Chirag Chaman wrote: > > John: > > > > We tested with QuickRules (YasuTech). > > The only non-commercial one I've used is Jess -- though it may have > > license issues. > > > > I know there is a big move to get open source XML rules engine made, > > especially since the RFC is now stabilized, so there should be some > > strong products coming out (hopefully soon). > > > > I think for now, something simple that incorporates GROUP and STOP > > should be sufficient for 80% of the needs (80/20 rule), as it will > > be flexible and fast (and you can skip over unnecessary rules). > > > > If you need any help -- please let me know (I'm not the best coder > > around, but can definitely have one of my engineers follow your lead). > > Current interface URLFilter.java may be too simple. > If you or your engineers can make a suggestion/evaluation for typical > nutch need, that will be great. The best would be some sample codes with Jess. > This is only about url filtering. > > Thanks, > > John > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide Read honest & candid > reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide Read honest & candid > reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > __________________________________________ http://www.neasys.com - A Good Place to Be Come to visit us today! ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
