Just a guess, but do you have an index on the table that you are using to store the URLs that still need to be parsed? This table is going to get huge! And if you do not delete the URL that you just parsed from the list it will grow even faster. And if you do not have an index on that table and you are doing a table scan to see if the new URL is in it or not, this is going to take longer and longer to complete every time you process another URL. This is because this temp table of URLs to process will always get larger, and will rarely go down in size because you add about 5+ new URLs for every one that you process. But then again, we don't know for sure on anything without seeing 'some' code. So far we have not seen any so everything is total speculation and guessing. I would be interested in seeing the code that handles the processing of the URLs once you cull them from a web page.
Jim Hunter -------Original Message------- From: Nicholas Fitzgerald Date: Wednesday, March 12, 2003 10:15:52 AM To: [EMAIL PROTECTED] Subject: Re: [PHP-DB] Real Killer App! Rich Gray wrote: >>I'm having a heck of a time trying to write a little web crawler for my >>intranet. I've got everything functionally working it seems like, but >>there is a very strange problem I can't nail down. If I put in an entry >>and start the crawler it goes great through the first loop. It gets the >>url, gets the page info, puts it in the database, and then parses all of >>the links out and puts them raw into the database. On the second loop it >>picks up all the new stuff and does the same thing. By the time the >>second loop is completed I'll have just over 300 items in the database. >>On the third loop is where the problem starts. Once it gets into the >>third loop, it starts to slow down a lot. Then, after a while, if I'm >>running from the command line, it'll just go to a command prompt. If I'm >>running in a browser, it returns a "document contains no data" error. >>This is with php 4.3.1 on a win2000 server. Haven't tried it on a linux >>box yet, but I'd rather run it on the windows server since it's bigger >>and has plenty of cpu, memory, and raid space. It's almost like the >>thing is getting confused when it starts to get more than 300 entries in >>the database. Any ideas out there as to what would cause this kind of >>problem? >> >>Nick >> >> > >Can you post some code? Are your script timeouts set appropriately? Does >memory/CPU useage increase dramatically or are there any other symptoms of >where it is choking...? What DB is it updating? What does the database tell >you is happening when it starts choking? What do debug messages tell you wrt >finding the bottleneck? Does it happen always no matter what start point is >used? Are you using recursive functions? > >Sorry lots of questions but no answers... :) > >Cheers >Rich > > Recognizing that this script would take a long time to run I'm using set_time_limit(0) in it so a timeout doesn't become an issue. The server has 1.5 gig of memory and is a dual processor 1GHz PIII. I have never seen it get over 15% cpu usage, even while this is going on, and it never gets anywhere near full memory usage. The tax on the system itself is actually negligable. There are no symptoms that I can find to indicate where the chokepoint might be. It seems to be when the DB reaches a certain size, but 300 or so records should be a piece of cake for it. As far as the debug backtrace, there really isn't anything there that stands out. It's not an issue with a variable, something is going wrong in the execution either of php, or a sql query. I'm not finding any errors in the mysql error log, or anywhere else. Basically the prog is in two parts. First, it goes and gets the current contents of the DB, one record at a time, and checks it. If it meets the criteria it is then indexed or reindexed. If it is indexed, then it goes to the second part. This is where it strips any links from the page and puts them in the DB for indexing, if thery're not already there. When it dies, this is where it dies. I'll get the "UPDATING: <title><url> message that comes up when it does an update, but at that point, where it is going into strip links, it dies right there. Nick > >