From:             nick at axelis dot com
Operating system: Windows 2000 sp3
PHP version:      4.3.1
PHP Bug Type:     Reproducible crash
Bug description:  script kicks out to command prompt.

I've tried running this in a browser and end up with a "document contains
no data" error. The script is intended to run from the command prompt. I'm
running it in two environments: 1. Red Hat 8.0, PHP 4.2.2, Apache 2.0.40.
The other is win2k sp3, PHP 4.3.1, Apache 2.0.44. On the linux box it runs
like a champ. It's fast, it's furious. On windows it starts out fine, but
then at a certain point it just starts hammering the hard drive and leaves
me at a command prompt. It doesn't seem to happen at a specific place in
the script. It's  seems more like a memory allocation problem. It does not
retur n any errors. I've found nothing in any of the system logs, apache
log, php error log, nothing. I did once get an error that said:
"erealloc(), failed to allocate 11 bytes." This did only happen once
though, all of the other times it just dies. The script is a search engine
spider. If I run it on a site with 20 or 30 pages to index it works great.
If I hit a site that's bigger, it dies, but in a different place depending
on the site. I've tested on at least 10 different sites with over 200
pages. The timing is consistent within a particular site, it always dies
at the same place. I've done enought testing to ensure that the sites
themselves are not the problem. Here's the script below:

<?php
require('../includes/config.inc');
global $robots, $keywords, $description, $title, $body, $url, $spiderday;
set_time_limit(0);

echo "##### The Spider is Running, Do Not Close This Console #####\n\n";

// Start the big loop
do {

// Open the database and start looking at URLs
$sql = mysql_query("SELECT * FROM search WHERE flag=0");
while($rslt = mysql_fetch_array($sql)){
        $flag = $rslt["flag"];
        $url = $rslt["url"];
        $crc = $rslt["checksum"];
        $date = $rslt["date"];

// Don't make them wait
        echo "\n\nWorking . . .\n";

// Don't go there if you don't have to
        if($flag == 1){
                continue;
        }

// Set the user agent to be sent
        ini_set('user_agent',$spiderhost);

// Open URL for parsing
        $open = @fopen("$url", "r");
        if($open){
                $read = fread($open, 100000);
                fclose($open);
        }
        else{
                $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                continue;
        }

// Set date and checksum info
        $today = date("Y-m-d");
        $checksum = crc32($read);
        $chkyr = strftime(date("Y"));
        $chkmo = strftime(date("m"));
        $chkdy = strftime(date("d"));
        $chkdy = $chkdy - $spiderday;
        $daycheck = strftime("%Y-%m-%d", mktime(0,0,0,$chkmo,$chkdy,$chkyr));

// Get meta tags and use get_meta_tags to check if the file is actually
there
        $meta = @get_meta_tags($url);
        if(!$meta){
                $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                continue;
        }
        $robots = $meta["robots"];
        $keywords = $meta["keywords"];
        $description = $meta["description"];

// Check robots meta tags
        $metarobots = "noindex";
        if(checkmetarobots($metarobots)){
                echo "Indexing disallowed by robots meta tag: $url\n";
                continue;
        }
        $metarobots = "none";
        if(checkmetarobots($metarobots)){
                echo "Indexing disallowed by robots meta tag: $url\n";
                continue;
        }


// Get the page title
        $temp = spliti("title>",$read,3);
        $title = substr($temp[1],0,-2);

// Get the page body
        $body = str_replace("'","`",trim(strip_tags($read)));

// Make an announcement
        echo "Now Processing: $url\n";

// "Put the stuff in the search database\n";
        if($crc != $checksum){
                echo "Updating for CRC: $title\n$url\n";
                $renew = @mysql_query("UPDATE search SET url='$url', title='$title',
metak='$keywords', metad='$description', mrobot='$robots',
checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE
url='$url'");
                if(!$renew){
                        echo "NOT UPDATED: $url<br>mysql_error()\n";
                        $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                        continue;               
                }
        }
        elseif($date <= $daycheck){
                echo "Updating for date: $title\n$url\n";
                $renew = @mysql_query("UPDATE search SET url='$url', title='$title',
metak='$keywords', metad='$description', mrobot='$robots',
checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE
url='$url'");
                if(!$renew){
                        echo "NOT UPDATED: $url<br>mysql_error()\n";
                        $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                        continue;               
                }

        }
        else{
                $renew = @mysql_query("UPDATE search SET flag=1 WHERE url='$url'");
                if(!$renew){
                        echo "NOT UPDATED: $url" . mysql_error() . "\n";
                        $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                }
                continue;
        }

// Check robots meta tags
        $metarobots = "nofollow";
        if(checkmetarobots($metarobots)){
                echo "Following disallowed by robots meta tag: $url\n";
                continue;
        }
        $metarobots = "none";
        if(checkmetarobots($metarobots)){
                echo "Following disallowed by robots meta tag: $url\n";
                continue;
        }

// "Parse the main URL\n";
        $top = parse_url($url);
        $tschm = $top["scheme"];
        $thost = $top["host"];
        $tpath = $top["path"];
        $tqury = $top["query"];
        $tfrag = $top["fragment"];

$currentdomain = $tschm . "://" . $thost;

// Parse all the links on the page
        $rtemp = stristr($read,"href"); 
        $temp = stristr($rtemp,">");
        while($rtemp){
        //"Parse the href out of the string\n";
                $rtemp = stristr($temp,"href"); 
                $lpos = strlen($rtemp) - strlen($temp);
                $temp = stristr($rtemp,">");
                $lend = strlen($rtemp) - strlen($temp);
                $alink = str_replace('"'," ",strip_tags(trim(substr($rtemp, 6,
($lend)))));
                $blink = stristr($alink," ");
                $alen = strlen($alink) - strlen($blink);
                $link = substr($alink, 0, $alen);

        // Kill any trailing slashes
                if(substr($link,(strlen($link)-1)) == "/"){
                        $link = substr($link,0,(strlen($link)-1));
                }

                if(checkforgarbage()){
                        continue;
                }

        // Parse the current link
                $bot = @parse_url($link);
                if(!$bot){
                        continue;
                }
                $bschm = $bot["scheme"];
                $bhost = $bot["host"];
                $bpath = $bot["path"];
                $bqury = $bot["query"];
                $bfrag = $bot["fragment"];

        // Execute robots exclusion standard via robots.txt
                if(checkrobotstxt()){
                        echo "Disallowed by robots.txt: $link\n";
                        continue;
                }

        // Kill off any fragment based URLs
                if(strlen($bfrag) > 0){
                        continue;
                }

        // Get rid of outside links
                if($bhost != "" && $bhost != $thost){
                        continue;
                }

        // Kill off any dot dots ../../ 
                $ddotcheck = substr_count($bpath,"../");
                if($ddotcheck != ""){
                        $lpos = strrpos($bpath,"..");
                        $bpath = substr($bpath,$lpos);
                }

        // Comparitive analisys
                if($bpath != "" && substr($bpath,0,1) != "/"){
                        if(strrpos($tpath,".") === false){
                                $bpath = $tpath . "/" . $bpath;
                        }
                        if(strrpos($tpath,".")){
                                $ttmp = substr($tpath,0,(strrpos($tpath,"/")+1));
                                $bpath = $ttmp . $bpath;
                                if(substr($bpath,0,1) != "/"){
                                        $bpath = "/" . $bpath;
                                }
                        }
                }

        // Check to see if the scheme and domain are in the url
                if($bhost == ""){
                        $link = $tschm . "://" . $thost . $bpath;
                }

        // Kill any trailing slashes
                if(substr($link,(strlen($link)-1)) == "/"){
                        $link = substr($link,0,(strlen($link)-1));
                }

        // If there is a query string put it back on
                if($bqury != ""){
                        $link = $link . "?" . $bqury;
                }

        // Don't be overly recursive
                if($link == $currentdomain){
                        continue;
                }

        // It it's a usless link, kill it
                if($link == ""){
                        continue;
                }

                if(!checkandupdatetoindexer()){
                        continue;
                }
        }

// Take the new URLs and put them in the search database, or finish if
there are no more
$movem = mysql_query("SELECT url FROM indexer");
while($mvrslt = mysql_fetch_array($movem)){
        $murl = $mvrslt["url"];
        $putem = mysql_query("INSERT INTO search SET url='$murl'");
}
$kill = mysql_query("DELETE FROM indexer");
}
$preloop = mysql_fetch_row(mysql_query("SELECT COUNT(checksum) AS count
FROM search WHERE checksum='0'"));
$loopcount = $preloop[0];
} while($loopcount > 0);

$done = mysql_query("UPDATE search SET flag=0 WHERE flag=1");

echo "\n\n##### The Spider is Finished, You Can Now Close This Console
#####\n";


//////  Spider Functions   //////

function checkandupdatetoindexer(){
        global $link;
        // "Put the new URL in the search database\n";
                $chk = @mysql_query("SELECT url FROM search");
                while($curec = mysql_fetch_array($chk)){
                        $curchk = $curec["url"];
                        if($curchk == $link){
                                return FALSE;
                        }
                }
                echo "Adding: $link\n";
                $putup = mysql_query("INSERT INTO indexer SET url='$link'");
                return TRUE;
}

function checkforgarbage(){
                global $link;
                // "Get rid of any garbage and most binary files in the link\n";
                if(substr_count(strtolower($link),"&?") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"@") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"javascript") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mailto") != 0){
                        return TRUE;
                }
                
                if(substr_count(strtolower($link),"jpg") != 0){
                        return TRUE;
                }
                
                if(substr_count(strtolower($link),"gif") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"pdf") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"pnf") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mpg") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mpeg") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"avi") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mp3") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"wav") != 0){
                        return TRUE;
                }
                
                return FALSE;
}

function checkmetarobots(){
        global $robots, $metarobots;
        if(substr_count($robots,$metarobots) > 0){
                return TRUE;
        }
        return FALSE;
}

function checkrobotstxt(){
        global $currentdomain, $bpath, $spiderhost;

        $getbot = $currentdomain . "/robots.txt";
        $robotay = @file($getbot);
                if(!$robotay){
                        return FALSE;
                }
        $robotaycount = count($rebotay);
        $roop = 0;
        while($roop <= $robotaycount){
                $curele = $robotay[$roop];
                if($curele == ""){
                        continue;
                }
                $thecolon = strpos($curele,":");
                if(substr($curele,0,$thecolon) == "User-agent:"){
                        $robgent = trim(substr($curele,$thecolon+1));
                        if($robgent == "*" || $robgent == $spiderhost){
                                $dospider = 1;
                        }
                        else{
                                $dospider = 0;
                        }
                }
                if(substr($curele,0,$thecolon) == "Disallow:"){
                        $robdis = trim(substr($curele,$thecolon+1));
                        echo "$robdis\n";
                        $roblen = strlen($robdis);
                        if(substr($bpath,0,$roblen) == $robdis && $dospider == 1){
                                return TRUE;
                }
                }
                ++$roop;
        }
        return FALSE;
}


?>
-- 
Edit bug report at http://bugs.php.net/?id=22820&edit=1
-- 
Try a CVS snapshot:         http://bugs.php.net/fix.php?id=22820&r=trysnapshot
Fixed in CVS:               http://bugs.php.net/fix.php?id=22820&r=fixedcvs
Fixed in release:           http://bugs.php.net/fix.php?id=22820&r=alreadyfixed
Need backtrace:             http://bugs.php.net/fix.php?id=22820&r=needtrace
Try newer version:          http://bugs.php.net/fix.php?id=22820&r=oldversion
Not developer issue:        http://bugs.php.net/fix.php?id=22820&r=support
Expected behavior:          http://bugs.php.net/fix.php?id=22820&r=notwrong
Not enough info:            http://bugs.php.net/fix.php?id=22820&r=notenoughinfo
Submitted twice:            http://bugs.php.net/fix.php?id=22820&r=submittedtwice
register_globals:           http://bugs.php.net/fix.php?id=22820&r=globals
PHP 3 support discontinued: http://bugs.php.net/fix.php?id=22820&r=php3
Daylight Savings:           http://bugs.php.net/fix.php?id=22820&r=dst
IIS Stability:              http://bugs.php.net/fix.php?id=22820&r=isapi
Install GNU Sed:            http://bugs.php.net/fix.php?id=22820&r=gnused

Reply via email to