From: nick at axelis dot com Operating system: Windows 2000 sp3 PHP version: 4.3.1 PHP Bug Type: Reproducible crash Bug description: script kicks out to command prompt.
I've tried running this in a browser and end up with a "document contains no data" error. The script is intended to run from the command prompt. I'm running it in two environments: 1. Red Hat 8.0, PHP 4.2.2, Apache 2.0.40. The other is win2k sp3, PHP 4.3.1, Apache 2.0.44. On the linux box it runs like a champ. It's fast, it's furious. On windows it starts out fine, but then at a certain point it just starts hammering the hard drive and leaves me at a command prompt. It doesn't seem to happen at a specific place in the script. It's seems more like a memory allocation problem. It does not retur n any errors. I've found nothing in any of the system logs, apache log, php error log, nothing. I did once get an error that said: "erealloc(), failed to allocate 11 bytes." This did only happen once though, all of the other times it just dies. The script is a search engine spider. If I run it on a site with 20 or 30 pages to index it works great. If I hit a site that's bigger, it dies, but in a different place depending on the site. I've tested on at least 10 different sites with over 200 pages. The timing is consistent within a particular site, it always dies at the same place. I've done enought testing to ensure that the sites themselves are not the problem. Here's the script below: <?php require('../includes/config.inc'); global $robots, $keywords, $description, $title, $body, $url, $spiderday; set_time_limit(0); echo "##### The Spider is Running, Do Not Close This Console #####\n\n"; // Start the big loop do { // Open the database and start looking at URLs $sql = mysql_query("SELECT * FROM search WHERE flag=0"); while($rslt = mysql_fetch_array($sql)){ $flag = $rslt["flag"]; $url = $rslt["url"]; $crc = $rslt["checksum"]; $date = $rslt["date"]; // Don't make them wait echo "\n\nWorking . . .\n"; // Don't go there if you don't have to if($flag == 1){ continue; } // Set the user agent to be sent ini_set('user_agent',$spiderhost); // Open URL for parsing $open = @fopen("$url", "r"); if($open){ $read = fread($open, 100000); fclose($open); } else{ $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } // Set date and checksum info $today = date("Y-m-d"); $checksum = crc32($read); $chkyr = strftime(date("Y")); $chkmo = strftime(date("m")); $chkdy = strftime(date("d")); $chkdy = $chkdy - $spiderday; $daycheck = strftime("%Y-%m-%d", mktime(0,0,0,$chkmo,$chkdy,$chkyr)); // Get meta tags and use get_meta_tags to check if the file is actually there $meta = @get_meta_tags($url); if(!$meta){ $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } $robots = $meta["robots"]; $keywords = $meta["keywords"]; $description = $meta["description"]; // Check robots meta tags $metarobots = "noindex"; if(checkmetarobots($metarobots)){ echo "Indexing disallowed by robots meta tag: $url\n"; continue; } $metarobots = "none"; if(checkmetarobots($metarobots)){ echo "Indexing disallowed by robots meta tag: $url\n"; continue; } // Get the page title $temp = spliti("title>",$read,3); $title = substr($temp[1],0,-2); // Get the page body $body = str_replace("'","`",trim(strip_tags($read))); // Make an announcement echo "Now Processing: $url\n"; // "Put the stuff in the search database\n"; if($crc != $checksum){ echo "Updating for CRC: $title\n$url\n"; $renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'"); if(!$renew){ echo "NOT UPDATED: $url<br>mysql_error()\n"; $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } } elseif($date <= $daycheck){ echo "Updating for date: $title\n$url\n"; $renew = @mysql_query("UPDATE search SET url='$url', title='$title', metak='$keywords', metad='$description', mrobot='$robots', checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE url='$url'"); if(!$renew){ echo "NOT UPDATED: $url<br>mysql_error()\n"; $kill = mysql_query("DELETE FROM search WHERE url='$url'"); continue; } } else{ $renew = @mysql_query("UPDATE search SET flag=1 WHERE url='$url'"); if(!$renew){ echo "NOT UPDATED: $url" . mysql_error() . "\n"; $kill = mysql_query("DELETE FROM search WHERE url='$url'"); } continue; } // Check robots meta tags $metarobots = "nofollow"; if(checkmetarobots($metarobots)){ echo "Following disallowed by robots meta tag: $url\n"; continue; } $metarobots = "none"; if(checkmetarobots($metarobots)){ echo "Following disallowed by robots meta tag: $url\n"; continue; } // "Parse the main URL\n"; $top = parse_url($url); $tschm = $top["scheme"]; $thost = $top["host"]; $tpath = $top["path"]; $tqury = $top["query"]; $tfrag = $top["fragment"]; $currentdomain = $tschm . "://" . $thost; // Parse all the links on the page $rtemp = stristr($read,"href"); $temp = stristr($rtemp,">"); while($rtemp){ //"Parse the href out of the string\n"; $rtemp = stristr($temp,"href"); $lpos = strlen($rtemp) - strlen($temp); $temp = stristr($rtemp,">"); $lend = strlen($rtemp) - strlen($temp); $alink = str_replace('"'," ",strip_tags(trim(substr($rtemp, 6, ($lend))))); $blink = stristr($alink," "); $alen = strlen($alink) - strlen($blink); $link = substr($alink, 0, $alen); // Kill any trailing slashes if(substr($link,(strlen($link)-1)) == "/"){ $link = substr($link,0,(strlen($link)-1)); } if(checkforgarbage()){ continue; } // Parse the current link $bot = @parse_url($link); if(!$bot){ continue; } $bschm = $bot["scheme"]; $bhost = $bot["host"]; $bpath = $bot["path"]; $bqury = $bot["query"]; $bfrag = $bot["fragment"]; // Execute robots exclusion standard via robots.txt if(checkrobotstxt()){ echo "Disallowed by robots.txt: $link\n"; continue; } // Kill off any fragment based URLs if(strlen($bfrag) > 0){ continue; } // Get rid of outside links if($bhost != "" && $bhost != $thost){ continue; } // Kill off any dot dots ../../ $ddotcheck = substr_count($bpath,"../"); if($ddotcheck != ""){ $lpos = strrpos($bpath,".."); $bpath = substr($bpath,$lpos); } // Comparitive analisys if($bpath != "" && substr($bpath,0,1) != "/"){ if(strrpos($tpath,".") === false){ $bpath = $tpath . "/" . $bpath; } if(strrpos($tpath,".")){ $ttmp = substr($tpath,0,(strrpos($tpath,"/")+1)); $bpath = $ttmp . $bpath; if(substr($bpath,0,1) != "/"){ $bpath = "/" . $bpath; } } } // Check to see if the scheme and domain are in the url if($bhost == ""){ $link = $tschm . "://" . $thost . $bpath; } // Kill any trailing slashes if(substr($link,(strlen($link)-1)) == "/"){ $link = substr($link,0,(strlen($link)-1)); } // If there is a query string put it back on if($bqury != ""){ $link = $link . "?" . $bqury; } // Don't be overly recursive if($link == $currentdomain){ continue; } // It it's a usless link, kill it if($link == ""){ continue; } if(!checkandupdatetoindexer()){ continue; } } // Take the new URLs and put them in the search database, or finish if there are no more $movem = mysql_query("SELECT url FROM indexer"); while($mvrslt = mysql_fetch_array($movem)){ $murl = $mvrslt["url"]; $putem = mysql_query("INSERT INTO search SET url='$murl'"); } $kill = mysql_query("DELETE FROM indexer"); } $preloop = mysql_fetch_row(mysql_query("SELECT COUNT(checksum) AS count FROM search WHERE checksum='0'")); $loopcount = $preloop[0]; } while($loopcount > 0); $done = mysql_query("UPDATE search SET flag=0 WHERE flag=1"); echo "\n\n##### The Spider is Finished, You Can Now Close This Console #####\n"; ////// Spider Functions ////// function checkandupdatetoindexer(){ global $link; // "Put the new URL in the search database\n"; $chk = @mysql_query("SELECT url FROM search"); while($curec = mysql_fetch_array($chk)){ $curchk = $curec["url"]; if($curchk == $link){ return FALSE; } } echo "Adding: $link\n"; $putup = mysql_query("INSERT INTO indexer SET url='$link'"); return TRUE; } function checkforgarbage(){ global $link; // "Get rid of any garbage and most binary files in the link\n"; if(substr_count(strtolower($link),"&?") != 0){ return TRUE; } if(substr_count(strtolower($link),"@") != 0){ return TRUE; } if(substr_count(strtolower($link),"javascript") != 0){ return TRUE; } if(substr_count(strtolower($link),"mailto") != 0){ return TRUE; } if(substr_count(strtolower($link),"jpg") != 0){ return TRUE; } if(substr_count(strtolower($link),"gif") != 0){ return TRUE; } if(substr_count(strtolower($link),"pdf") != 0){ return TRUE; } if(substr_count(strtolower($link),"pnf") != 0){ return TRUE; } if(substr_count(strtolower($link),"mpg") != 0){ return TRUE; } if(substr_count(strtolower($link),"mpeg") != 0){ return TRUE; } if(substr_count(strtolower($link),"avi") != 0){ return TRUE; } if(substr_count(strtolower($link),"mp3") != 0){ return TRUE; } if(substr_count(strtolower($link),"wav") != 0){ return TRUE; } return FALSE; } function checkmetarobots(){ global $robots, $metarobots; if(substr_count($robots,$metarobots) > 0){ return TRUE; } return FALSE; } function checkrobotstxt(){ global $currentdomain, $bpath, $spiderhost; $getbot = $currentdomain . "/robots.txt"; $robotay = @file($getbot); if(!$robotay){ return FALSE; } $robotaycount = count($rebotay); $roop = 0; while($roop <= $robotaycount){ $curele = $robotay[$roop]; if($curele == ""){ continue; } $thecolon = strpos($curele,":"); if(substr($curele,0,$thecolon) == "User-agent:"){ $robgent = trim(substr($curele,$thecolon+1)); if($robgent == "*" || $robgent == $spiderhost){ $dospider = 1; } else{ $dospider = 0; } } if(substr($curele,0,$thecolon) == "Disallow:"){ $robdis = trim(substr($curele,$thecolon+1)); echo "$robdis\n"; $roblen = strlen($robdis); if(substr($bpath,0,$roblen) == $robdis && $dospider == 1){ return TRUE; } } ++$roop; } return FALSE; } ?> -- Edit bug report at http://bugs.php.net/?id=22820&edit=1 -- Try a CVS snapshot: http://bugs.php.net/fix.php?id=22820&r=trysnapshot Fixed in CVS: http://bugs.php.net/fix.php?id=22820&r=fixedcvs Fixed in release: http://bugs.php.net/fix.php?id=22820&r=alreadyfixed Need backtrace: http://bugs.php.net/fix.php?id=22820&r=needtrace Try newer version: http://bugs.php.net/fix.php?id=22820&r=oldversion Not developer issue: http://bugs.php.net/fix.php?id=22820&r=support Expected behavior: http://bugs.php.net/fix.php?id=22820&r=notwrong Not enough info: http://bugs.php.net/fix.php?id=22820&r=notenoughinfo Submitted twice: http://bugs.php.net/fix.php?id=22820&r=submittedtwice register_globals: http://bugs.php.net/fix.php?id=22820&r=globals PHP 3 support discontinued: http://bugs.php.net/fix.php?id=22820&r=php3 Daylight Savings: http://bugs.php.net/fix.php?id=22820&r=dst IIS Stability: http://bugs.php.net/fix.php?id=22820&r=isapi Install GNU Sed: http://bugs.php.net/fix.php?id=22820&r=gnused