I was disappointed when the Perl scripts in the contrib
directory were broken by changes in the database format.
Here's a first cut at a HtDig::Database module that attempts
to fill the gap, along with a couple of short scripts that
demonstrate how it can be used.
The way the module deals with URL encoding is still pretty
simple minded -- I haven't attempted to duplicate all the
logic in HtWordCodec.cc, but it should work for some of the
most common cases. It may be that a better long term
solution would be to interface to the code in HtWordCodec.cc
as an XSUB.
I understand that some of the HtDig module namespace has
already been staked out. Comments on the naming of this
module (or anything else) would be welcome.
--
Warren Jones
------------------------------ snip snip ------------------------------
#! /bin/sh
# This is a shell archive. Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file". To overwrite existing
# files, type "sh file -c". You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g.. If this archive is complete, you
# will see the following message at the end:
# "End of shell archive."
#
# Contents: Database.pm listconfig listdocdb
#
# Wrapped by wjones@addy on Tue Jan 18 16:26:45 2000
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'Database.pm' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'Database.pm'\"
else
echo shar: Extracting \"'Database.pm'\" \(9693 characters\)
sed "s/^X//" >'Database.pm' <<'END_OF_FILE'
Xpackage HtDig::Database;
X
X=head1 NAME
X
XHtdig::Database - Perl interface Ht://Dig docdb and config files
X
X=head1 SYNOPSIS
X
X use Htdig::Database;
X
X my $config = Htdig::Database::get_config( $config_file )
X or die "$0: Can't access $config_file\n";
X my $record = Htdig::Database::parse_docdb( $docdb_record );
X print "URL = $record->{URL}\n";
X
X=head1 DESCRIPTION
X
X=head2 Exported functions
X
XThe following functions are provided by Htdig::Database:
X
X get_config
X parse_docdb
X encode_url
X decode_url
X
XBy default, functions are not exported into the callers namespace,
Xand you must invoke them using the full package name, e.g.:
X
X Htdig::Database::getconfig( $config_file );
X
XTo import all available function names, invoke the module with:
X
X use Htdig::Database qw(:all);
X
X=head2 Parsing a config file
X
XC<get_config> parses a config file and returns a hash ref that
Xcontains the configuration attributes. For example:
X
X my $config = Htdig::Database::get_config( $config_file )
X or die "$0: Can't access $config_file\n";
X print "start_url = $config->{start_url}\n";
X
XAll values in the hash are scalars, and any items that are intended
Xto be lists or booleans must be parsed by the calling program.
XC<get_config> returns C<undef> if the config file can't be opened,
Xand carps about various syntax errors.
X
X=head2 Parsing a record from the document database
X
XC<parse_docdb> parses a record from the document database
Xand returns a hash ref. For example:
X
X my %docdb;
X tie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
X die "$0: Unable to open $docdb: $!";
X
X while ( my ( $key, $value ) = each %docdb ) {
X next if $key =~ /^nextDocID/;
X my %rec = Htdig::Database:parse_docdb( $value );
X print " URL: $record->{URL}\n";
X print "HOPCOUNT: $record->{HOPCOUNT}\n";
X }
X
XURL's in the database are encoded using two attributes from the
Xconfiguration file: I<common_url_parts> and I<url_part_aliases>.
XC<parse_docdb> does only rudimentary decoding. It can't
Xhandle more than 25 elements in the I<common_url_parts> list,
Xand it currently can't handle I<url_part_aliases> at all.
X
XC<get_config> caches the value of I<common_url_parts> that's
Xused for decoding URL's, and should usually be called before
XC<parse_docdb>.
X
XCompressed data in the HEAD element will be automatically decompressed
Xif the Compress::Zlib module is available. If Compress::Zlib is not
Xinstalled, compressed data will be silently replaced by the string:
X
X "Compressed data: Zlib not available"
X
XIf only a single value is needed from the database record,
Xit can be specified as a second parameter to C<parse_docdb>,
Xwhich then returns the requested value as a scalar. For example:
X
X my $url = Htdig::Database:parse_docdb( $value, 'URL' );
X
X=head2 Encoding a URL
X
X my $encoded_url = Htdig::Database::encode_url( $url );
X
XThis may be useful for computing database keys, since the keys
Xare encoded URL's. C<get_config> should be called before C<encode_url>
Xor C<decode_url> to initialize the value of C<common_url_parts>.
X
X=head2 Decoding a URL
X
X my $url = Htdig::Database::decode_url( $encoded_url );
X
XThis should seldom be necessary, since URL's are normally
Xdecoded by C<parse_docdb>.
X
X=head1 AUTHOR
X
XWarren Jones E<lt>F<[EMAIL PROTECTED]>E<gt>
X
X=head1 BUGS
X
XOnly simple cases of URL encoding are handled correctly.
XNo more than 25 elements are allowed in I<common_url_parts>.
XThe value of I<url_part_aliases> is not used at all.
XSomeday this module may implement the same URL encoding
Xlogic found in F<HtWordCodec.cc>, but a better solution might
Xbe to provide an XSUB interface to the C++ functions.
X
X=cut
X
X# $Id: Database.pm,v 1.1 2000/01/19 00:17:02 wjones Exp $
X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/Database.pm,v $
X
Xrequire 5.000;
Xuse Carp;
Xuse Exporter;
Xuse strict;
Xuse vars qw( $VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS );
X
X# Load Compress::Zlib if possible, but it's not
X# an error if compression is not available.
X
Xif ( eval { require Compress::Zlib } ) {
X import Compress::Zlib;
X}
X
X# Constants used in URL encoding (see HtWordCodec.cc):
X
X$VERSION = 0.50;
X@ISA = qw( Exporter );
X@EXPORT = ();
X@EXPORT_OK = qw( get_config parse_docdb encode_url decode_url );
X%EXPORT_TAGS = ( 'all' => [ @EXPORT_OK ] );
X
X# These strings are used as hash keys, and correspond to
X# integer field codes in the docdb data structure.
X# The order is important:
X
Xmy @fields = qw(
X ID TIME ACCESSED STATE SIZE LINKS IMAGESIZE HOPCOUNT
X URL HEAD TITLE DESCRIPTIONS ANCHORS EMAIL NOTIFICATION
X SUBJECT STRING METADSC BACKLINKS SIG
X);
X
X# These are the string fields and the list fields.
X# All remaining fields are integers.
X
Xmy %string_fields = map { $_, 1 } qw(
X URL HEAD TITLE EMAIL NOTIFICATION SUBJECT STRING METADSC
X);
X
Xmy %list_fields = map { $_, 1 } qw( DESCRIPTIONS ANCHORS );
X
X
X# These variables are used by &encode_url and &decode_url:
X
Xuse constant FIRST_INTERNAL_SINGLECHAR => 7;
Xuse constant LAST_INTERNAL_SINGLECHAR => 31;
X
Xmy ( @url_parts, %url_parts, $url_parts );
Xmy @default_url_parts = qw(
X http:// http://www. ftp:// ftp://ftp. /pub/
X .html .gif .jpg .jpeg /index.html /index.htm
X .com/ .com mailto:
X);
Xmy $matchchars = sprintf '[\0%o-\0%o]', FIRST_INTERNAL_SINGLECHAR,
X LAST_INTERNAL_SINGLECHAR;
Xmy $maxparts = LAST_INTERNAL_SINGLECHAR -
X FIRST_INTERNAL_SINGLECHAR + 1;
Xmy $warning = '';
X
Xsub set_url_parts { # Setup for variables used by
X # &encode_url and &decode_url
X my $code = FIRST_INTERNAL_SINGLECHAR;
X %url_parts = map { $_, chr($code++) } @url_parts = @_;
X $url_parts = join '|', map { quotemeta($_) } @_;
X $warning = "Too many common_url_parts: can't handle more than $maxparts.\n"
X if $#url_parts > $maxparts;
X}
X
Xset_url_parts( @default_url_parts ); # Initialize with defaults.
X
Xsub getnum { # Extract integer from doc doc record.
X my ( $flags, $in ) = @_;
X my ( $fmt, $length ) = ( 'I', 4 );
X ( $fmt, $length ) = ( 'C', 1 ) if $flags & 0100;
X ( $fmt, $length ) = ( 'S', 2 ) if $flags & 0200;
X $_[1] = substr($in,$length+1);
X unpack($fmt,substr($in,1));
X}
X
Xsub getstring { # Extract string from doc record.
X my $length = getnum( @_ );
X my $string = substr( $_[1], 0, $length );
X $_[1] = substr( $_[1], $length );
X return $string;
X}
X
Xsub getlist { # Extract list from doc record.
X my ( $flags, $in ) = @_;
X my $count = getnum( $flags, $in );
X my @list = ();
X for ( my $i=0; $i<$count; $i++ ) {
X my $length = 255;
X if ( $flags ) {
X $length = unpack('C',$in);
X $in = substr($in,1);
X }
X if ( $length > 253 ) {
X $length = unpack('I',$in);
X $in = substr($in,4);
X }
X push @list, substr($in,0,$length);
X $in = substr($in,$length);
X }
X $_[1] = $in;
X return \@list;
X}
X
Xsub parse_docdb
X{
X my $record = shift;
X my %record = ();
X while ( length($record) > 0 ) {
X my $code = unpack('C', $record);
X my $flags = $code & 0300;
X $code &= 077;
X if ( $code > $#fields ) {
X carp "Invalid item code: $code";
X last;
X }
X my $field = $fields[$code];
X my $value;
X if ( $list_fields{$field} ) {
X $value = getlist($flags,$record);
X } elsif ( $string_fields{$field} ) {
X $value = getstring($flags,$record);
X $value = decode_url($value) if $field eq 'URL';
X if ( $field eq 'HEAD' && substr($value,0,2) eq "x\234" ) {
X if ( defined &inflateInit ) {
X my ( $i, $zstatus ) = inflateInit();
X ( $value, $zstatus ) = $i->inflate($value);
X }
X else {
X $value = 'Compressed data: Zlib not available';
X }
X }
X
X } else {
X $value = getnum($flags,$record);
X }
X return $value if $_[0] && $_[0] eq $field;
X $record{$field} = $value;
X }
X return $_[0] ? '' : %record;
X}
X
Xsub decode_url {
X local($_) = shift;
X if ( $warning ) {
X carp $warning;
X }
X else {
X s/$matchchars/$url_parts[ord($&)-&FIRST_INTERNAL_SINGLECHAR]/oeg;
X }
X $_;
X}
X
Xsub encode_url {
X local($_) = shift;
X if ( $warning ) {
X carp $warning;
X }
X else {
X s/($url_parts)/$url_parts{$&}/eg;
X }
X $_;
X}
X
Xsub get_config {
X#
X# The first argument is the name of an htdig config file.
X# The second parameter, if present, is a hash ref that is
X# to receive the config values. The second parameter is
X# used only for recursive calls in the case of included files.
X# A hash ref is returned if on success, or undef if the
X# file cannot be opened.
X#
X my $file = shift;
X my $config = shift || {};
X no strict 'refs';
X if ( ! open $file, $file ) {
X return undef;
X }
X while ( <$file> ) {
X $_ .= <$file> while s/\\\n/ / && ! eof($file);
X next if /^\s*(#|$)/;
X ( my $key, $_ ) = /^\s*(\w+)\s*:\s*(.*)/;
X if ( ! $key ) {
X carp "Syntax error in $file (line $.)";
X next;
X }
X s/\${(\w+)}/$config->{$1} || ''/ge; # variable substitution
X s/`([^`]+)`/file_contents($1,$file)/ge; # file substitution
X if ( $key eq 'include' ) {
X $_ = "$1/$_" if ! m|^/| && $file =~ m|(.*)/|;
X get_config( $_, $config );
X }
X else {
X $config->{$key} = $_;
X }
X }
X close $file;
X use strict 'refs';
X my $parts = $config->{common_url_parts};
X set_url_parts( defined($parts) ? split( ' ', $parts ) :
X @default_url_parts );
X $warning .= "URL translation can't handle url_part_aliases.\n"
X if $config->{url_part_aliases};
X return $config;
X}
X
Xsub file_contents {
X#
X# Return the contents of a file, with newlines
X# replaced by a single space.
X#
X if ( ! open FILE, $_[0] ) {
X carp "Can't access included file $_[0] at $_[1] line $.";
X return '';
X }
X my @file_contents = map { chomp, $_ } <FILE>;
X close FILE;
X return "@file_contents";
X}
X
X1;
END_OF_FILE
if test 9693 -ne `wc -c <'Database.pm'`; then
echo shar: \"'Database.pm'\" unpacked with wrong size!
fi
# end of 'Database.pm'
fi
if test -f 'listconfig' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'listconfig'\"
else
echo shar: Extracting \"'listconfig'\" \(631 characters\)
sed "s/^X//" >'listconfig' <<'END_OF_FILE'
X#! /usr/local/bin/perl -w
X#
X# listconfig -- list the contents of an ht://Dig configuration file
X#
X# Usage: listconfig config_file
X#
X# This primarily intended as a test and demo of the get_config
X# function in the HtDig::Database module.
X#
X# $Id: listconfig,v 1.1 2000/01/19 00:17:02 wjones Exp $
X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/listconfig,v $
X
Xuse HtDig::Database qw(:all);
Xuse strict;
X
Xmy $config_file = shift
X or die "usage: $0 config_file\n";
X
Xmy $config = get_config( $config_file )
X or die "$0: can't access $config_file\n";
X
Xfor ( sort keys %$config ) {
X print "$_:\t$config->{$_}\n";
X}
X
END_OF_FILE
if test 631 -ne `wc -c <'listconfig'`; then
echo shar: \"'listconfig'\" unpacked with wrong size!
fi
chmod +x 'listconfig'
# end of 'listconfig'
fi
if test -f 'listdocdb' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'listdocdb'\"
else
echo shar: Extracting \"'listdocdb'\" \(2207 characters\)
sed "s/^X//" >'listdocdb' <<'END_OF_FILE'
X#! /usr/local/bin/perl -w
X#
X# listdocdb -- List ht://Dig document database
X#
X# Usage: listdocdb [-v [-v]] config_file [ docdb_file ]
X#
X# By default, only the URL's are listed from the database.
X# With a single "-v" flag, other fields are also listed,
X# including the first 60 characters of the HEAD excerpt.
X# With two "-v" flags, the full text of the HEAD excerpt
X# is shown.
X#
X# If the docdb file isn't specified on the command line,
X# the path will be found in the config file, or if it
X# can't be found there, a guess will be made based on
X# the path to the config file.
X#
X# $Id: listdocdb,v 1.1 2000/01/19 00:17:02 wjones Exp $
X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/listdocdb,v $
X
Xuse DB_File;
Xuse HtDig::Database qw(:all);
Xuse strict;
X
Xmy $verbose = 0;
X
Xwhile ( @ARGV && $ARGV[0] eq '-v' ) {
X $verbose++;
X shift;
X}
X
Xmy ( $config_file, $docdb ) = @ARGV;
X
Xdie "Usage: $0 [-v] config_file [docdb_file]\n" if ! $config_file;
X
Xmy $config = get_config( $config_file ) or
X die "$0: Can't access $config_file\n";
X
Xif ( ! $docdb ) {
X
X # If database file isn't specified on the command line,
X # get if from the config file, or guess based on path
X # to config file.
X
X my $database_base = $config->{database_base};
X if ( ! $database_base ) {
X my $database_dir = $config->{database_dir};
X if ( ! $database_dir ) {
X my ( $config_dir ) = ( $config_file =~ m|(.*)/| );
X $database_dir = $config_dir ? "$config_dir/../db" : "../db";
X }
X $database_base = "$database_dir/db";
X }
X $docdb = "$database_base.docdb";
X}
X
Xmy %docdb;
Xtie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
X die "$0: Unable to open $docdb: $!";
X
Xwhile ( my ( $key, $value ) = each %docdb ) {
X next if $key =~ /^nextDocID/;
X if ( ! $verbose ) {
X print decode_url( $key ), "\n";
X }
X else {
X my %rec = parse_docdb( $value );
X for ( sort keys %rec ) {
X my $field = $rec{$_};
X $field = join( "\n\t\t", @$field ) if ref($field) eq 'ARRAY';
X $field = localtime( $field ) if /^(TIME|ACCESSED)$/;
X $field = substr( $field, 0, 60 ) if /^HEAD$/ && $verbose < 2;
X printf "%13s: %s\n", $_, $field;
X }
X print '='x60, "\n";
X }
X}
X
END_OF_FILE
if test 2207 -ne `wc -c <'listdocdb'`; then
echo shar: \"'listdocdb'\" unpacked with wrong size!
fi
chmod +x 'listdocdb'
# end of 'listdocdb'
fi
echo shar: End of shell archive.
exit 0
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.