ID:               22820
 User updated by:  nick at axelis dot com
 Reported By:      nick at axelis dot com
-Status:           Feedback
+Status:           Open
 Bug Type:         Reproducible crash
 Operating System: Windows 2000 sp3
 PHP Version:      4.3.1
 New Comment:

Ok. I got the latest snapshot and applied it. The results where not
what I would expect. Wit the new snapshot I can't use the sapi mod for
apache 2, apache won't load when with it. I've now got it configured to
use the CGI, and that works. The problem, however, still remains, there
is no change.


Previous Comments:
------------------------------------------------------------------------

[2003-03-22 04:37:31] [EMAIL PROTECTED]

Please try using this CVS snapshot:

  http://snaps.php.net/php4-STABLE-latest.tar.gz
 
For Windows:
 
  http://snaps.php.net/win32/php4-win32-STABLE-latest.zip



------------------------------------------------------------------------

[2003-03-21 23:35:09] nick at axelis dot com

I've tried running this in a browser and end up with a "document
contains no data" error. The script is intended to run from the command
prompt. I'm running it in two environments: 1. Red Hat 8.0, PHP 4.2.2,
Apache 2.0.40. The other is win2k sp3, PHP 4.3.1, Apache 2.0.44. On the
linux box it runs like a champ. It's fast, it's furious. On windows it
starts out fine, but then at a certain point it just starts hammering
the hard drive and leaves me at a command prompt. It doesn't seem to
happen at a specific place in the script. It's  seems more like a
memory allocation problem. It does not retur n any errors. I've found
nothing in any of the system logs, apache log, php error log, nothing.
I did once get an error that said: "erealloc(), failed to allocate 11
bytes." This did only happen once though, all of the other times it
just dies. The script is a search engine spider. If I run it on a site
with 20 or 30 pages to index it works great. If I hit a site that's
bigger, it dies, but in a different place depending on the site. I've
tested on at least 10 different sites with over 200 pages. The timing
is consistent within a particular site, it always dies at the same
place. I've done enought testing to ensure that the sites themselves
are not the problem. Here's the script below:

<?php
require('../includes/config.inc');
global $robots, $keywords, $description, $title, $body, $url,
$spiderday;
set_time_limit(0);

echo "##### The Spider is Running, Do Not Close This Console
#####\n\n";

// Start the big loop
do {

// Open the database and start looking at URLs
$sql = mysql_query("SELECT * FROM search WHERE flag=0");
while($rslt = mysql_fetch_array($sql)){
        $flag = $rslt["flag"];
        $url = $rslt["url"];
        $crc = $rslt["checksum"];
        $date = $rslt["date"];

// Don't make them wait
        echo "\n\nWorking . . .\n";

// Don't go there if you don't have to
        if($flag == 1){
                continue;
        }

// Set the user agent to be sent
        ini_set('user_agent',$spiderhost);

// Open URL for parsing
        $open = @fopen("$url", "r");
        if($open){
                $read = fread($open, 100000);
                fclose($open);
        }
        else{
                $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                continue;
        }

// Set date and checksum info
        $today = date("Y-m-d");
        $checksum = crc32($read);
        $chkyr = strftime(date("Y"));
        $chkmo = strftime(date("m"));
        $chkdy = strftime(date("d"));
        $chkdy = $chkdy - $spiderday;
        $daycheck = strftime("%Y-%m-%d", mktime(0,0,0,$chkmo,$chkdy,$chkyr));

// Get meta tags and use get_meta_tags to check if the file is actually
there
        $meta = @get_meta_tags($url);
        if(!$meta){
                $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                continue;
        }
        $robots = $meta["robots"];
        $keywords = $meta["keywords"];
        $description = $meta["description"];

// Check robots meta tags
        $metarobots = "noindex";
        if(checkmetarobots($metarobots)){
                echo "Indexing disallowed by robots meta tag: $url\n";
                continue;
        }
        $metarobots = "none";
        if(checkmetarobots($metarobots)){
                echo "Indexing disallowed by robots meta tag: $url\n";
                continue;
        }


// Get the page title
        $temp = spliti("title>",$read,3);
        $title = substr($temp[1],0,-2);

// Get the page body
        $body = str_replace("'","`",trim(strip_tags($read)));

// Make an announcement
        echo "Now Processing: $url\n";

// "Put the stuff in the search database\n";
        if($crc != $checksum){
                echo "Updating for CRC: $title\n$url\n";
                $renew = @mysql_query("UPDATE search SET url='$url', title='$title',
metak='$keywords', metad='$description', mrobot='$robots',
checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE
url='$url'");
                if(!$renew){
                        echo "NOT UPDATED: $url<br>mysql_error()\n";
                        $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                        continue;               
                }
        }
        elseif($date <= $daycheck){
                echo "Updating for date: $title\n$url\n";
                $renew = @mysql_query("UPDATE search SET url='$url', title='$title',
metak='$keywords', metad='$description', mrobot='$robots',
checksum='$checksum', date=CURDATE(), flag=1, body='$body' WHERE
url='$url'");
                if(!$renew){
                        echo "NOT UPDATED: $url<br>mysql_error()\n";
                        $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                        continue;               
                }

        }
        else{
                $renew = @mysql_query("UPDATE search SET flag=1 WHERE url='$url'");
                if(!$renew){
                        echo "NOT UPDATED: $url" . mysql_error() . "\n";
                        $kill = mysql_query("DELETE FROM search WHERE url='$url'");
                }
                continue;
        }

// Check robots meta tags
        $metarobots = "nofollow";
        if(checkmetarobots($metarobots)){
                echo "Following disallowed by robots meta tag: $url\n";
                continue;
        }
        $metarobots = "none";
        if(checkmetarobots($metarobots)){
                echo "Following disallowed by robots meta tag: $url\n";
                continue;
        }

// "Parse the main URL\n";
        $top = parse_url($url);
        $tschm = $top["scheme"];
        $thost = $top["host"];
        $tpath = $top["path"];
        $tqury = $top["query"];
        $tfrag = $top["fragment"];

$currentdomain = $tschm . "://" . $thost;

// Parse all the links on the page
        $rtemp = stristr($read,"href"); 
        $temp = stristr($rtemp,">");
        while($rtemp){
        //"Parse the href out of the string\n";
                $rtemp = stristr($temp,"href"); 
                $lpos = strlen($rtemp) - strlen($temp);
                $temp = stristr($rtemp,">");
                $lend = strlen($rtemp) - strlen($temp);
                $alink = str_replace('"'," ",strip_tags(trim(substr($rtemp, 6,
($lend)))));
                $blink = stristr($alink," ");
                $alen = strlen($alink) - strlen($blink);
                $link = substr($alink, 0, $alen);

        // Kill any trailing slashes
                if(substr($link,(strlen($link)-1)) == "/"){
                        $link = substr($link,0,(strlen($link)-1));
                }

                if(checkforgarbage()){
                        continue;
                }

        // Parse the current link
                $bot = @parse_url($link);
                if(!$bot){
                        continue;
                }
                $bschm = $bot["scheme"];
                $bhost = $bot["host"];
                $bpath = $bot["path"];
                $bqury = $bot["query"];
                $bfrag = $bot["fragment"];

        // Execute robots exclusion standard via robots.txt
                if(checkrobotstxt()){
                        echo "Disallowed by robots.txt: $link\n";
                        continue;
                }

        // Kill off any fragment based URLs
                if(strlen($bfrag) > 0){
                        continue;
                }

        // Get rid of outside links
                if($bhost != "" && $bhost != $thost){
                        continue;
                }

        // Kill off any dot dots ../../ 
                $ddotcheck = substr_count($bpath,"../");
                if($ddotcheck != ""){
                        $lpos = strrpos($bpath,"..");
                        $bpath = substr($bpath,$lpos);
                }

        // Comparitive analisys
                if($bpath != "" && substr($bpath,0,1) != "/"){
                        if(strrpos($tpath,".") === false){
                                $bpath = $tpath . "/" . $bpath;
                        }
                        if(strrpos($tpath,".")){
                                $ttmp = substr($tpath,0,(strrpos($tpath,"/")+1));
                                $bpath = $ttmp . $bpath;
                                if(substr($bpath,0,1) != "/"){
                                        $bpath = "/" . $bpath;
                                }
                        }
                }

        // Check to see if the scheme and domain are in the url
                if($bhost == ""){
                        $link = $tschm . "://" . $thost . $bpath;
                }

        // Kill any trailing slashes
                if(substr($link,(strlen($link)-1)) == "/"){
                        $link = substr($link,0,(strlen($link)-1));
                }

        // If there is a query string put it back on
                if($bqury != ""){
                        $link = $link . "?" . $bqury;
                }

        // Don't be overly recursive
                if($link == $currentdomain){
                        continue;
                }

        // It it's a usless link, kill it
                if($link == ""){
                        continue;
                }

                if(!checkandupdatetoindexer()){
                        continue;
                }
        }

// Take the new URLs and put them in the search database, or finish if
there are no more
$movem = mysql_query("SELECT url FROM indexer");
while($mvrslt = mysql_fetch_array($movem)){
        $murl = $mvrslt["url"];
        $putem = mysql_query("INSERT INTO search SET url='$murl'");
}
$kill = mysql_query("DELETE FROM indexer");
}
$preloop = mysql_fetch_row(mysql_query("SELECT COUNT(checksum) AS count
FROM search WHERE checksum='0'"));
$loopcount = $preloop[0];
} while($loopcount > 0);

$done = mysql_query("UPDATE search SET flag=0 WHERE flag=1");

echo "\n\n##### The Spider is Finished, You Can Now Close This Console
#####\n";


//////  Spider Functions   //////

function checkandupdatetoindexer(){
        global $link;
        // "Put the new URL in the search database\n";
                $chk = @mysql_query("SELECT url FROM search");
                while($curec = mysql_fetch_array($chk)){
                        $curchk = $curec["url"];
                        if($curchk == $link){
                                return FALSE;
                        }
                }
                echo "Adding: $link\n";
                $putup = mysql_query("INSERT INTO indexer SET url='$link'");
                return TRUE;
}

function checkforgarbage(){
                global $link;
                // "Get rid of any garbage and most binary files in the link\n";
                if(substr_count(strtolower($link),"&?") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"@") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"javascript") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mailto") != 0){
                        return TRUE;
                }
                
                if(substr_count(strtolower($link),"jpg") != 0){
                        return TRUE;
                }
                
                if(substr_count(strtolower($link),"gif") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"pdf") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"pnf") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mpg") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mpeg") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"avi") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"mp3") != 0){
                        return TRUE;
                }

                if(substr_count(strtolower($link),"wav") != 0){
                        return TRUE;
                }
                
                return FALSE;
}

function checkmetarobots(){
        global $robots, $metarobots;
        if(substr_count($robots,$metarobots) > 0){
                return TRUE;
        }
        return FALSE;
}

function checkrobotstxt(){
        global $currentdomain, $bpath, $spiderhost;

        $getbot = $currentdomain . "/robots.txt";
        $robotay = @file($getbot);
                if(!$robotay){
                        return FALSE;
                }
        $robotaycount = count($rebotay);
        $roop = 0;
        while($roop <= $robotaycount){
                $curele = $robotay[$roop];
                if($curele == ""){
                        continue;
                }
                $thecolon = strpos($curele,":");
                if(substr($curele,0,$thecolon) == "User-agent:"){
                        $robgent = trim(substr($curele,$thecolon+1));
                        if($robgent == "*" || $robgent == $spiderhost){
                                $dospider = 1;
                        }
                        else{
                                $dospider = 0;
                        }
                }
                if(substr($curele,0,$thecolon) == "Disallow:"){
                        $robdis = trim(substr($curele,$thecolon+1));
                        echo "$robdis\n";
                        $roblen = strlen($robdis);
                        if(substr($bpath,0,$roblen) == $robdis && $dospider == 1){
                                return TRUE;
                }
                }
                ++$roop;
        }
        return FALSE;
}


?>

------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=22820&edit=1

Reply via email to