On Apr 30, 2007, at 9:50 PM, Gonzalo Servat wrote:
> On 5/1/07, John David Anderson (_psychic_) <[EMAIL PROTECTED]>
> wrote:
>
> On Apr 30, 2007, at 9:19 PM, Gonzalo Servat wrote:
>
> I created a search engine using a few classes from the Zend
> "Framework." They've got a nice port of the guts of Lucene, and its
> pretty easy to create your own search component.
>
> My content is almost completely in static view templates, so I
> created a script that uses wget to pull down the content, and some
> ZF classes to plug it into the index.
>
> Thanks for your reply John. Would you be able to provide more info
> on this? I'd be interested to know what logic you used to write the
> script that wget's the content, and if you have the ZF classes
> handy, that would rock too :)
Want me to deliver some dinner too?
:)
Here's a censored copy of my crawler script (/app/webroot/crawl.php).
This is a copy of the app/webroot/index.php file that I modified to
run as a script. Its really easy to make cron scripts this way - the
index.php file loads up the cake core, so using it as a template
works nice. I plan to run it daily using cron/launchd on the
production machine.
After that is a copy of my search component (/app/controllers/
components/search.php). Both files assume that you have some Zend
libs in a vendors (/vendors/zend/Zend and /vendors/zend/Zend.php is
how I have it set up). I don't need to provide those: they're freely
available from Zend's website. Just make sure you wash your hands
after handling.
The normal disclaimers apply: This is a first run try on this code,
and hasn't really been tested much. If you have suggestions or
questions, feel free to send me gifts and/or bribes. I hope it helps
you rather than deletes the contents of your disk and spreads your
personal information on the Internet, but you'll have to assume some
risks on using this code, as I can't really guarantee it yet. :)
Happy baking,
-- John
<?php
$start = microtime(true);
/**
* Do not change
*/
if (!defined('DS')) {
define('DS', DIRECTORY_SEPARATOR);
}
/**
* These defines should only be edited if you have cake installed in
* a directory layout other than the way it is distributed.
* Each define has a commented line of code that explains what you
would change.
*/
if (!defined('ROOT')) {
//define('ROOT', 'FULL PATH TO DIRECTORY WHERE APP DIRECTORY
IS
LOCATED. DO NOT ADD A TRAILING DIRECTORY SEPARATOR');
//You should also use the DS define to separate your
directories
define('ROOT', dirname(dirname(dirname(__FILE__))));
}
if (!defined('APP_DIR')) {
//define('APP_DIR', 'DIRECTORY NAME OF APPLICATION');
define('APP_DIR', basename(dirname(dirname(__FILE__))));
}
/**
* This only needs to be changed if the cake installed libs are located
* outside of the distributed directory structure.
*/
if (!defined('CAKE_CORE_INCLUDE_PATH')) {
//define ('CAKE_CORE_INCLUDE_PATH', 'FULL PATH TO DIRECTORY
WHERE
CAKE CORE IS INSTALLED. DO NOT ADD A TRAILING DIRECTORY SEPARATOR');
//You should also use the DS define to separate your
directories
define('CAKE_CORE_INCLUDE_PATH', ROOT);
}
///////////////////////////////
//DO NOT EDIT BELOW THIS LINE//
///////////////////////////////
if (!defined('WEBROOT_DIR')) {
define('WEBROOT_DIR', basename(dirname(__FILE__)));
}
if (!defined('WWW_ROOT')) {
define('WWW_ROOT', dirname(__FILE__) . DS);
}
if (!defined('CORE_PATH')) {
if (function_exists('ini_set')) {
ini_set('include_path', CAKE_CORE_INCLUDE_PATH .
PATH_SEPARATOR . ROOT . DS . APP_DIR . DS . PATH_SEPARATOR . ini_get
('include_path'));
define('APP_PATH', null);
define('CORE_PATH', null);
} else {
define('APP_PATH', ROOT . DS . APP_DIR . DS);
define('CORE_PATH', CAKE_CORE_INCLUDE_PATH . DS);
}
}
if (!include(CORE_PATH . 'cake' . DS . 'bootstrap.php')) {
trigger_error("Can't find CakePHP core. Check the value of
CAKE_CORE_INCLUDE_PATH in app/webroot/index.php. It should point to
the directory containing your " . DS . "cake core directory and your
" . DS . "vendors root directory." , E_USER_ERROR);
}
/*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
=-=-=-=-=-=-=-=-=-=-=-=-*/
//Add Zend libs to include path
$include = ini_get('include_path');
$new_include = $include . ':' . VENDORS . 'zend';
ini_set('include_path', $new_include);
//Include Zend_Search Classes
vendor('zend' . DS . 'Zend' . DS . 'Search' . DS . 'Lucene');
//Only allow this script to run via shell access
if(!isset($_SERVER['TERM']) && !isset($_SERVER['SHELL']))
{
die('Web Access Denied.');
}
$download_path = TMP . 'html';
$wget_log_path = TMP . 'wget.log';
$url = 'example.com';
$cmd = '/usr/bin/wget';
$args = '-rv --reject=gif,jpg,swf,css,xml
--output-file=' .
$wget_log_path . ' http://' . $url;
$command = $cmd . ' ' . $args;
$index_path = TMP . 'index';
//Refresh wget cache
rmdir($download_path);
mkdir($download_path);
chdir($download_path);
//Get a fresh mirror
shell_exec($command);
//Run through the wget log to glean URLs
$wget_results_array = explode("\n", file_get_contents($wget_log_path));
$urls = array();
foreach($wget_results_array as $line)
{
//Find lines that have URLs
if(preg_match('/^\-\-\d+:\d+:\d+\-\-/', $line) > 0)
{
//Remove the timestamp
$parts =
preg_split('/^\-\-\d+:\d+:\d+\-\-\s+http:\/\//', $line);
//Remove surrounding whitespace and the site base URL
$urls[] = str_replace($url, '', trim($parts[1]));
}
}
//Re-create the Lucene search index
rmdir($index_path);
$index = Zend_Search_Lucene::create($index_path);
//Add each document to the new index
foreach($urls as $path)
{
$link = $path;
//wget saves directory indexes as .html files...
if(substr($link, -1, 1) == '/')
{
$path = $link . 'index.html';
}
$doc_content = file_get_contents($download_path . DS . $url .
$path);
$doc =
Zend_Search_Lucene_Document_Html::loadHTMLFile($doc_content);
$doc->addField(Zend_Search_Lucene_Field::Text('url', $link));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
$doc_content));
$index->addDocument($doc);
//echo "Document added. URL: $link CONTENT: " . strlen
($doc_content) . " chars\n";
}
$index->optimize();
$doc_size = $index->count();
$elapsed = number_format(microtime(true) - $start, 2);
echo "Crawl complete. Indexed $doc_size documents in $elapsed
seconds.\n";
?>
<?php
class SearchComponent extends Object
{
var $controller = null;
var $index = null;
function startup(&$controller)
{
//Add Zend libs to include path
$include = ini_get('include_path');
$new_include = $include . ':' . VENDORS . 'zend/';
ini_set('include_path', $new_include);
//Include Zend_Search Classes
require_once('Zend' . DS . 'Search' . DS . 'Lucene.php');
$index_path = TMP . 'uindex';
$this->controller = $controller;
//Construct the index object
$this->index = Zend_Search_Lucene::open($index_path);
}
function execute($query)
{
//Perform a basic query
$hits = $this->index->find($query);
//For each hit, retreive the originating URL
foreach($hits as $hit)
{
$doc = $hit->getDocument();
$hit->url = $doc->getFieldValue('url');
$hit->title = $doc->getFieldValue('title');
$hit->body = $doc->getFieldValue('body');
}
return $hits;
}
}
?>
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Cake
PHP" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/cake-php?hl=en
-~----------~----~----~----~------~----~------~--~---