On Apr 30, 2007, at 9:50 PM, Gonzalo Servat wrote:

> On 5/1/07, John David Anderson (_psychic_) <[EMAIL PROTECTED]>  
> wrote:
>
> On Apr 30, 2007, at 9:19 PM, Gonzalo Servat wrote:
>
> I created a search engine using a few classes from the Zend  
> "Framework." They've got a nice port of the guts of Lucene, and its  
> pretty easy to create your own search component.
>
> My content is almost completely in static view templates, so I  
> created a script that uses wget to pull down the content, and some  
> ZF classes to plug it into the index.
>
> Thanks for your reply John. Would you be able to provide more info  
> on this? I'd be interested to know what logic you used to write the  
> script that wget's the content, and if you have the ZF classes  
> handy, that would rock too :)

Want me to deliver some dinner too?

:)

Here's a censored copy of my crawler script (/app/webroot/crawl.php).  
This is a copy of the app/webroot/index.php file that I modified to  
run as a script. Its really easy to make cron scripts this way - the  
index.php file loads up the cake core, so using it as a template  
works nice. I plan to run it daily using cron/launchd on the  
production machine.

After that is a copy of my search component (/app/controllers/ 
components/search.php). Both files assume that you have some Zend  
libs in a vendors (/vendors/zend/Zend and /vendors/zend/Zend.php is  
how I have it set up).  I don't need to provide those: they're freely  
available from Zend's website. Just make sure you wash your hands  
after handling.

The normal disclaimers apply: This is a first run try on this code,  
and hasn't really been tested much. If you have suggestions or  
questions, feel free to send me gifts and/or bribes. I hope it helps  
you rather than deletes the contents of your disk and spreads your  
personal information on the Internet, but you'll have to assume some  
risks on using this code, as I can't really guarantee it yet. :)

Happy baking,

-- John


<?php

$start = microtime(true);

/**
* Do not change
*/
        if (!defined('DS')) {
                 define('DS', DIRECTORY_SEPARATOR);
        }
/**
* These defines should only be edited if you have cake installed in
* a directory layout other than the way it is distributed.
* Each define has a commented line of code that explains what you  
would change.
*/
        if (!defined('ROOT')) {
                 //define('ROOT', 'FULL PATH TO DIRECTORY WHERE APP DIRECTORY 
IS  
LOCATED. DO NOT ADD A TRAILING DIRECTORY SEPARATOR');
                 //You should also use the DS define to separate your 
directories
                 define('ROOT', dirname(dirname(dirname(__FILE__))));
        }
        if (!defined('APP_DIR')) {
                 //define('APP_DIR', 'DIRECTORY NAME OF APPLICATION');
                 define('APP_DIR', basename(dirname(dirname(__FILE__))));
        }
/**
* This only needs to be changed if the cake installed libs are located
* outside of the distributed directory structure.
*/
        if (!defined('CAKE_CORE_INCLUDE_PATH')) {
                 //define ('CAKE_CORE_INCLUDE_PATH', 'FULL PATH TO DIRECTORY 
WHERE  
CAKE CORE IS INSTALLED. DO NOT ADD A TRAILING DIRECTORY SEPARATOR');
                 //You should also use the DS define to separate your 
directories
                 define('CAKE_CORE_INCLUDE_PATH', ROOT);
        }
///////////////////////////////
//DO NOT EDIT BELOW THIS LINE//
///////////////////////////////
        if (!defined('WEBROOT_DIR')) {
                 define('WEBROOT_DIR', basename(dirname(__FILE__)));
        }
        if (!defined('WWW_ROOT')) {
                 define('WWW_ROOT', dirname(__FILE__) . DS);
        }
        if (!defined('CORE_PATH')) {
                 if (function_exists('ini_set')) {
                          ini_set('include_path', CAKE_CORE_INCLUDE_PATH .  
PATH_SEPARATOR . ROOT . DS . APP_DIR . DS . PATH_SEPARATOR . ini_get 
('include_path'));
                          define('APP_PATH', null);
                          define('CORE_PATH', null);
                 } else {
                          define('APP_PATH', ROOT . DS . APP_DIR . DS);
                          define('CORE_PATH', CAKE_CORE_INCLUDE_PATH . DS);
                 }
        }
        if (!include(CORE_PATH . 'cake' . DS . 'bootstrap.php')) {
                trigger_error("Can't find CakePHP core.  Check the value of  
CAKE_CORE_INCLUDE_PATH in app/webroot/index.php.  It should point to  
the directory containing your " . DS . "cake core directory and your  
" . DS . "vendors root directory." , E_USER_ERROR);
        }
        
        /*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- 
=-=-=-=-=-=-=-=-=-=-=-=-*/
        
        //Add Zend libs to include path
        $include = ini_get('include_path');
        $new_include = $include . ':' . VENDORS . 'zend';
        ini_set('include_path', $new_include);
        
        //Include Zend_Search Classes
        vendor('zend' . DS . 'Zend' . DS . 'Search' . DS . 'Lucene');
        
        //Only allow this script to run via shell access
        if(!isset($_SERVER['TERM']) && !isset($_SERVER['SHELL']))
        {
                die('Web Access Denied.');
        }
        
        $download_path  = TMP . 'html';
        $wget_log_path  = TMP . 'wget.log';
        $url                    = 'example.com';
        $cmd                    = '/usr/bin/wget';
        $args                   = '-rv --reject=gif,jpg,swf,css,xml 
--output-file=' .  
$wget_log_path . ' http://' . $url;
        $command                = $cmd . ' ' . $args;
        $index_path             = TMP . 'index';
        
        //Refresh wget cache
        rmdir($download_path);
        mkdir($download_path);
        chdir($download_path);
        
        //Get a fresh mirror
        shell_exec($command);   

        //Run through the wget log to glean URLs
        $wget_results_array = explode("\n", file_get_contents($wget_log_path));
        $urls = array();
        
        foreach($wget_results_array as $line)
        {
                //Find lines that have URLs
                if(preg_match('/^\-\-\d+:\d+:\d+\-\-/', $line) > 0)
                {
                        //Remove the timestamp
                        $parts = 
preg_split('/^\-\-\d+:\d+:\d+\-\-\s+http:\/\//', $line);
                        
                        //Remove surrounding whitespace and the site base URL
                        $urls[] = str_replace($url, '', trim($parts[1]));
                }
        }
        
        //Re-create the Lucene search index
        rmdir($index_path);
        $index = Zend_Search_Lucene::create($index_path);

        //Add each document to the new index
        foreach($urls as $path)
        {
                $link = $path;
                
                //wget saves directory indexes as .html files...
                if(substr($link, -1, 1) == '/')
                {
                        $path = $link . 'index.html';
                }
                
                $doc_content = file_get_contents($download_path . DS . $url . 
$path);
                $doc = 
Zend_Search_Lucene_Document_Html::loadHTMLFile($doc_content);
                $doc->addField(Zend_Search_Lucene_Field::Text('url', $link));
                $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',  
$doc_content));
                
                $index->addDocument($doc);
                
                //echo "Document added. URL: $link CONTENT: " . strlen 
($doc_content) . " chars\n";
        }
        
        $index->optimize();
        
        $doc_size = $index->count();
        $elapsed = number_format(microtime(true) - $start, 2);

        echo "Crawl complete. Indexed $doc_size documents in $elapsed  
seconds.\n";
                
?>






<?php

class SearchComponent extends Object
{
        var $controller         = null;
        var $index                      = null;

        function startup(&$controller)
        {
                //Add Zend libs to include path
                $include = ini_get('include_path');
                $new_include = $include . ':' . VENDORS . 'zend/';
                ini_set('include_path', $new_include);
                
                //Include Zend_Search Classes
                require_once('Zend' . DS . 'Search' . DS . 'Lucene.php');
        
                $index_path = TMP . 'uindex';
                $this->controller = $controller;
                
                //Construct the index object    
                $this->index = Zend_Search_Lucene::open($index_path);
        }
        
        function execute($query)
        {
                //Perform a basic query
                $hits = $this->index->find($query);
                
                //For each hit, retreive the originating URL
                foreach($hits as $hit)
                {
                        $doc = $hit->getDocument();
                        $hit->url = $doc->getFieldValue('url');
                        $hit->title = $doc->getFieldValue('title');
                        $hit->body = $doc->getFieldValue('body');
                }
                
                return $hits;
        }
}

?>



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Cake 
PHP" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to