ID: 22820 Updated by: [EMAIL PROTECTED] Reported By: nick at axelis dot com -Status: Open +Status: Feedback Bug Type: Reproducible crash Operating System: Windows 2000 sp3 PHP Version: 4.3.1 New Comment:
About the Apache2 sapi, you need Apache 2.0.44 installed. About the cli problem, please provide a _SHORT_ example script which we can use to test this. And I mean a script that is max. 15-20 lines long and runs as-is. Previous Comments: ------------------------------------------------------------------------ [2003-03-23 19:35:43] nick at axelis dot com Ok. I got the latest snapshot and applied it. The results where not what I would expect. Wit the new snapshot I can't use the sapi mod for apache 2, apache won't load when with it. I've now got it configured to use the CGI, and that works. The problem, however, still remains, there is no change. ------------------------------------------------------------------------ [2003-03-22 04:37:31] [EMAIL PROTECTED] Please try using this CVS snapshot: http://snaps.php.net/php4-STABLE-latest.tar.gz For Windows: http://snaps.php.net/win32/php4-win32-STABLE-latest.zip ------------------------------------------------------------------------ [2003-03-21 23:35:09] nick at axelis dot com I've tried running this in a browser and end up with a "document contains no data" error. The script is intended to run from the command prompt. I'm running it in two environments: 1. Red Hat 8.0, PHP 4.2.2, Apache 2.0.40. The other is win2k sp3, PHP 4.3.1, Apache 2.0.44. On the linux box it runs like a champ. It's fast, it's furious. On windows it starts out fine, but then at a certain point it just starts hammering the hard drive and leaves me at a command prompt. It doesn't seem to happen at a specific place in the script. It's seems more like a memory allocation problem. It does not retur n any errors. I've found nothing in any of the system logs, apache log, php error log, nothing. I did once get an error that said: "erealloc(), failed to allocate 11 bytes." This did only happen once though, all of the other times it just dies. The script is a search engine spider. If I run it on a site with 20 or 30 pages to index it works great. If I hit a site that's bigger, it dies, but in a different place depending on the site. I've tested on at least 10 different sites with over 200 pages. The timing is consistent within a particular site, it always dies at the same place. I've done enought testing to ensure that the sites themselves are not the problem. Here's the script below: <?php require('../includes/config.inc'); global $robots, $keywords, $description, $title, $body, $url, $spiderday; set_time_limit(0); echo "##### The Spider is Running, Do Not Close This Console #####\n\n"; // Start the big loop do { // Open the database and start looking at URLs $sql = mysql_query("SELECT * FROM search WHERE flag=0"); while($rslt = mysql_fetch_array($sql)){ $flag = $rslt["flag"]; $url = $rslt["url"]; $crc = $rslt["checksum"]; $date = $rslt["date"]; // Don't make them wait echo "\n\nWorking . . .\n"; // Don't go there if you don't have to if($flag == 1){ continue; } // Set the user agent to be sent ini_set('user_agent',$spiderhost); // Open URL for parsing $open = @fopen("$url", "r"); if($open){ $read = fread($open, 100000); fclose($open); } else{ $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } // Set date and checksum info $today = date("Y-m-d"); $checksum = crc32($read); $chkyr = strftime(date("Y")); $chkmo = strftime(date("m")); $chkdy = strftime(date("d")); $chkdy = $chkdy - $spiderday; $daycheck = strftime("%Y-%m-%d", mktime(0,0,0,$chkmo,$chkdy,$chkyr)); // Get meta tags and use get_meta_tags to check if the file is actually there $meta = @get_meta_tags($url); if(!$meta){ $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } $robots = $meta["robots"]; $keywords = $meta["keywords"]; $description = $meta["description"]; // Check robots meta tags $metarobots = "noindex"; if(checkmetarobots($metarobots)){ echo "Indexing disallowed by robots meta tag: $url\n"; continue; } $metarobots = "none"; if(checkmetarobots($metarobots)){ echo "Indexing disallowed by robots meta tag: $url\n"; continue; } // Get the page title $temp = spliti("title>",$read,3); $title = substr($temp[1],0,-2); // Get the page body $body = str_replace("'","`",trim(strip_tags($read))); // Make an announcement echo "Now Processing: $url\n"; // "Put the stuff in the search database\n"; if($crc != $checksum){ echo "Updating for CRC: $title\n$url\n"; $renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'"); if(!$renew){ echo "NOT UPDATED: $url<br>mysql_error()\n"; $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } } elseif($date <= $daycheck){ echo "Updating for date: $title\n$url\n"; $renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'"); if(!$renew){ echo "NOT UPDATED: $url<br>mysql_error()\n"; $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } } else{ $renew = @mysql_query("UPDATE search SET flag=1 WHERE url='$url'"); if(!$renew){ echo "NOT UPDATED: $url" . mysql_error() . "\n"; $kill = mysql_query("DELETE FROM search WHERE url='$url'"); } continue; } // Check robots meta tags $metarobots = "nofollow"; if(checkmetarobots($metarobots)){ echo "Following disallowed by robots meta tag: $url\n"; continue; } $metarobots = "none"; if(checkmetarobots($metarobots)){ echo "Following disallowed by robots meta tag: $url\n"; continue; } // "Parse the main URL\n"; $top = parse_url($url); $tschm = $top["scheme"]; $thost = $top["host"]; $tpath = $top["path"]; $tqury = $top["query"]; $tfrag = $top["fragment"]; $currentdomain = $tschm . "://" . $thost; // Parse all the links on the page $rtemp = stristr($read,"href"); $temp = stristr($rtemp,">"); while($rtemp){ //"Parse the href out of the string\n"; $rtemp = stristr($temp,"href"); $lpos = strlen($rtemp) - strlen($temp); $temp = stristr($rtemp,">"); $lend = strlen($rtemp) - strlen($temp); $alink = str_replace('"'," ",strip_tags(trim(substr($rtemp, 6, ($lend))))); $blink = stristr($alink," "); $alen = strlen($alink) - strlen($blink); $link = substr($alink, 0, $alen); // Kill any trailing slashes if(substr($link,(strlen($link)-1)) == "/"){ $link = substr($link,0,(strlen($link)-1)); } if(checkforgarbage()){ continue; } // Parse the current link $bot = @parse_url($link); if(!$bot){ continue; } $bschm = $bot["scheme"]; $bhost = $bot["host"]; $bpath = $bot["path"]; $bqury = $bot["query"]; $bfrag = $bot["fragment"]; // Execute robots exclusion standard via robots.txt if(checkrobotstxt()){ echo "Disallowed by robots.txt: $link\n"; continue; } // Kill off any fragment based URLs if(strlen($bfrag) > 0){ continue; } // Get rid of outside links if($bhost != "" && $bhost != $thost){ continue; } // Kill off any dot dots ../../ $ddotcheck = substr_count($bpath,"../"); if($ddotcheck != ""){ $lpos = strrpos($bpath,".."); $bpath = substr($bpath,$lpos); } // Comparitive analisys if($bpath != "" && substr($bpath,0,1) != "/"){ if(strrpos($tpath,".") === false){ $bpath = $tpath . "/" . $bpath; } if(strrpos($tpath,".")){ $ttmp = substr($tpath,0,(strrpos($tpath,"/")+1)); $bpath = $ttmp . $bpath; if(substr($bpath,0,1) != "/"){ $bpath = "/" . $bpath; } } } // Check to see if the scheme and domain are in the url if($bhost == ""){ $link = $tschm . "://" . $thost . $bpath; } // Kill any trailing slashes if(substr($link,(strlen($link)-1)) == "/"){ $link = substr($link,0,(strlen($link)-1)); } // If there is a query string put it back on if($bqury != ""){ $link = $link . "?" . $bqury; } // Don't be overly recursive if($link == $currentdomain){ continue; } // It it's a usless link, kill it if($link == ""){ continue; } if(!checkandupdatetoindexer()){ continue; } } // Take the new URLs and put them in the search database, or finish if there are no more $movem = mysql_query("SELECT url FROM indexer"); while($mvrslt = mysql_fetch_array($movem)){ $murl = $mvrslt["url"]; $putem = mysql_query("INSERT INTO search SET url='$murl'"); } $kill = mysql_query("DELETE FROM indexer"); } $preloop = mysql_fetch_row(mysql_query("SELECT COUNT(checksum) AS count FROM search WHERE checksum='0'")); $loopcount = $preloop[0]; } while($loopcount > 0); $done = mysql_query("UPDATE search SET flag=0 WHERE flag=1"); echo "\n\n##### The Spider is Finished, You Can Now Close This Console #####\n"; ////// Spider Functions ////// function checkandupdatetoindexer(){ global $link; // "Put the new URL in the search database\n"; $chk = @mysql_query("SELECT url FROM search"); while($curec = mysql_fetch_array($chk)){ $curchk = $curec["url"]; if($curchk == $link){ return FALSE; } } echo "Adding: $link\n"; $putup = mysql_query("INSERT INTO indexer SET url='$link'"); return TRUE; } function checkforgarbage(){ global $link; // "Get rid of any garbage and most binary files in the link\n"; if(substr_count(strtolower($link),"&?") != 0){ return TRUE; } if(substr_count(strtolower($link),"@") != 0){ return TRUE; } if(substr_count(strtolower($link),"javascript") != 0){ return TRUE; } if(substr_count(strtolower($link),"mailto") != 0){ return TRUE; } if(substr_count(strtolower($link),"jpg") != 0){ return TRUE; } if(substr_count(strtolower($link),"gif") != 0){ return TRUE; } if(substr_count(strtolower($link),"pdf") != 0){ return TRUE; } if(substr_count(strtolower($link),"pnf") != 0){ return TRUE; } if(substr_count(strtolower($link),"mpg") != 0){ return TRUE; } if(substr_count(strtolower($link),"mpeg") != 0){ return TRUE; } if(substr_count(strtolower($link),"avi") != 0){ return TRUE; } if(substr_count(strtolower($link),"mp3") != 0){ return TRUE; } if(substr_count(strtolower($link),"wav") != 0){ return TRUE; } return FALSE; } function checkmetarobots(){ global $robots, $metarobots; if(substr_count($robots,$metarobots) > 0){ return TRUE; } return FALSE; } function checkrobotstxt(){ global $currentdomain, $bpath, $spiderhost; $getbot = $currentdomain . "/robots.txt"; $robotay = @file($getbot); if(!$robotay){ return FALSE; } $robotaycount = count($rebotay); $roop = 0; while($roop <= $robotaycount){ $curele = $robotay[$roop]; if($curele == ""){ continue; } $thecolon = strpos($curele,":"); if(substr($curele,0,$thecolon) == "User-agent:"){ $robgent = trim(substr($curele,$thecolon+1)); if($robgent == "*" || $robgent == $spiderhost){ $dospider = 1; } else{ $dospider = 0; } } if(substr($curele,0,$thecolon) == "Disallow:"){ $robdis = trim(substr($curele,$thecolon+1)); echo "$robdis\n"; $roblen = strlen($robdis); if(substr($bpath,0,$roblen) == $robdis && $dospider == 1){ return TRUE; } } ++$roop; } return FALSE; } ?> ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=22820&edit=1