I was disappointed when the Perl scripts in the contrib
directory were broken by changes in the database format.
Here's a first cut at a HtDig::Database module that attempts
to fill the gap, along with a couple of short scripts that
demonstrate how it can be used.

The way the module deals with URL encoding is still pretty
simple minded -- I haven't attempted to duplicate all the
logic in HtWordCodec.cc, but it should work for some of the
most common cases.  It may be that a better long term
solution would be to interface to the code in HtWordCodec.cc
as an XSUB.

I understand that some of the HtDig module namespace has
already been staked out.  Comments on the naming of this
module (or anything else) would be welcome.

--
Warren Jones

------------------------------ snip snip ------------------------------
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#               "End of shell archive."
#
# Contents:  Database.pm listconfig listdocdb
#
# Wrapped by wjones@addy on Tue Jan 18 16:26:45 2000
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'Database.pm' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'Database.pm'\"
else
echo shar: Extracting \"'Database.pm'\" \(9693 characters\)
sed "s/^X//" >'Database.pm' <<'END_OF_FILE'
Xpackage HtDig::Database;
X
X=head1 NAME
X
XHtdig::Database - Perl interface Ht://Dig docdb and config files
X
X=head1 SYNOPSIS
X
X    use Htdig::Database;
X
X    my $config = Htdig::Database::get_config( $config_file )
X           or die "$0: Can't access $config_file\n";
X    my $record = Htdig::Database::parse_docdb( $docdb_record );
X    print "URL = $record->{URL}\n";
X
X=head1 DESCRIPTION
X
X=head2 Exported functions
X
XThe following functions are provided by Htdig::Database:
X
X    get_config
X    parse_docdb
X    encode_url
X    decode_url
X
XBy default, functions are not exported into the callers namespace,
Xand you must invoke them using the full package name, e.g.:
X
X    Htdig::Database::getconfig( $config_file );
X
XTo import all available function names, invoke the module with:
X
X    use Htdig::Database qw(:all);
X
X=head2 Parsing a config file
X
XC<get_config> parses a config file and returns a hash ref that
Xcontains the configuration attributes.  For example:
X
X    my $config = Htdig::Database::get_config( $config_file )
X       or die "$0: Can't access $config_file\n";
X    print "start_url = $config->{start_url}\n";
X
XAll values in the hash are scalars, and any items that are intended
Xto be lists or booleans must be parsed by the calling program.
XC<get_config> returns C<undef> if the config file can't be opened,
Xand carps about various syntax errors.
X
X=head2 Parsing a record from the document database
X
XC<parse_docdb> parses a record from the document database
Xand returns a hash ref.  For example:
X
X    my %docdb;
X    tie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
X       die "$0: Unable to open $docdb: $!";
X
X    while ( my ( $key, $value ) = each %docdb ) {
X       next if $key =~ /^nextDocID/;
X        my %rec = Htdig::Database:parse_docdb( $value );
X       print "     URL: $record->{URL}\n";
X       print "HOPCOUNT: $record->{HOPCOUNT}\n";
X    }
X
XURL's in the database are encoded using two attributes from the
Xconfiguration file: I<common_url_parts> and I<url_part_aliases>.
XC<parse_docdb> does only rudimentary decoding.  It can't
Xhandle more than 25 elements in the I<common_url_parts> list,
Xand it currently can't handle I<url_part_aliases> at all.
X
XC<get_config> caches the value of I<common_url_parts> that's
Xused for decoding URL's, and should usually be called before
XC<parse_docdb>.
X
XCompressed data in the HEAD element will be automatically decompressed
Xif the Compress::Zlib module is available.  If Compress::Zlib is not
Xinstalled, compressed data will be silently replaced by the string:
X
X    "Compressed data: Zlib not available"
X
XIf only a single value is needed from the database record,
Xit can be specified as a second parameter to C<parse_docdb>,
Xwhich then returns the requested value as a scalar.  For example:
X
X    my $url = Htdig::Database:parse_docdb( $value, 'URL' );
X
X=head2 Encoding a URL
X
X    my $encoded_url = Htdig::Database::encode_url( $url );
X
XThis may be useful for computing database keys, since the keys
Xare encoded URL's.  C<get_config> should be called before C<encode_url>
Xor C<decode_url> to initialize the value of C<common_url_parts>.
X
X=head2 Decoding a URL
X
X    my $url = Htdig::Database::decode_url( $encoded_url );
X
XThis should seldom be necessary, since URL's are normally
Xdecoded by C<parse_docdb>.
X
X=head1 AUTHOR
X
XWarren Jones E<lt>F<[EMAIL PROTECTED]>E<gt>
X
X=head1 BUGS
X
XOnly simple cases of URL encoding are handled correctly.
XNo more than 25 elements are allowed in I<common_url_parts>.
XThe value of I<url_part_aliases> is not used at all.
XSomeday this module may implement the same URL encoding
Xlogic found in F<HtWordCodec.cc>, but a better solution might
Xbe to provide an XSUB interface to the C++ functions.
X
X=cut
X
X# $Id: Database.pm,v 1.1 2000/01/19 00:17:02 wjones Exp $
X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/Database.pm,v $
X
Xrequire 5.000;
Xuse Carp;
Xuse Exporter;
Xuse strict;
Xuse vars qw( $VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS );
X
X# Load Compress::Zlib if possible, but it's not
X# an error if compression is not available.
X
Xif ( eval { require Compress::Zlib } ) {
X    import Compress::Zlib;
X}
X
X# Constants used in URL encoding (see HtWordCodec.cc):
X
X$VERSION     = 0.50;
X@ISA         = qw( Exporter );
X@EXPORT      = ();
X@EXPORT_OK   = qw( get_config parse_docdb encode_url decode_url );
X%EXPORT_TAGS = ( 'all' => [ @EXPORT_OK ] );
X
X# These strings are used as hash keys, and correspond to
X# integer field codes in the docdb data structure.
X# The order is important:
X
Xmy @fields = qw(
X       ID TIME ACCESSED STATE SIZE LINKS IMAGESIZE HOPCOUNT
X       URL HEAD TITLE DESCRIPTIONS ANCHORS EMAIL NOTIFICATION
X       SUBJECT STRING METADSC BACKLINKS SIG
X);
X
X# These are the string fields and the list fields.
X# All remaining fields are integers.
X
Xmy %string_fields = map { $_, 1 } qw(
X       URL HEAD TITLE EMAIL NOTIFICATION SUBJECT STRING METADSC
X);
X
Xmy %list_fields = map { $_, 1 } qw( DESCRIPTIONS ANCHORS );
X
X
X# These variables are used by &encode_url and &decode_url:
X
Xuse constant FIRST_INTERNAL_SINGLECHAR =>  7;
Xuse constant  LAST_INTERNAL_SINGLECHAR => 31;
X
Xmy ( @url_parts, %url_parts, $url_parts );
Xmy @default_url_parts = qw(
X    http:// http://www. ftp:// ftp://ftp. /pub/
X    .html .gif .jpg .jpeg /index.html /index.htm
X    .com/ .com mailto:
X);
Xmy $matchchars = sprintf '[\0%o-\0%o]', FIRST_INTERNAL_SINGLECHAR,
X                                        LAST_INTERNAL_SINGLECHAR;
Xmy $maxparts = LAST_INTERNAL_SINGLECHAR -
X              FIRST_INTERNAL_SINGLECHAR + 1;
Xmy $warning  = '';
X
Xsub set_url_parts {    # Setup for variables used by
X                       # &encode_url and &decode_url
X    my $code = FIRST_INTERNAL_SINGLECHAR;
X    %url_parts = map { $_, chr($code++) } @url_parts = @_;
X    $url_parts = join '|', map { quotemeta($_) } @_;
X    $warning = "Too many common_url_parts: can't handle more than $maxparts.\n"
X        if $#url_parts > $maxparts;
X}
X
Xset_url_parts( @default_url_parts );   # Initialize with defaults.
X
Xsub getnum {                   # Extract integer from doc doc record.
X    my ( $flags, $in ) = @_;
X    my ( $fmt, $length ) = ( 'I', 4 );
X       ( $fmt, $length ) = ( 'C', 1 ) if $flags & 0100;
X       ( $fmt, $length ) = ( 'S', 2 ) if $flags & 0200;
X    $_[1] = substr($in,$length+1);
X    unpack($fmt,substr($in,1)); 
X}
X
Xsub getstring {                        # Extract string from doc record.
X    my $length = getnum( @_ );
X    my $string = substr( $_[1], 0, $length );
X    $_[1] = substr( $_[1], $length );
X    return $string;
X}
X
Xsub getlist {                  # Extract list from doc record.
X    my ( $flags, $in ) = @_;
X    my $count = getnum( $flags, $in );
X    my @list = ();
X    for ( my $i=0; $i<$count; $i++ ) {
X       my $length = 255;
X       if ( $flags ) {
X           $length = unpack('C',$in); 
X           $in = substr($in,1);
X       }
X       if ( $length > 253 ) {
X           $length = unpack('I',$in);
X           $in = substr($in,4);
X       }
X       push @list, substr($in,0,$length);
X       $in = substr($in,$length);
X    }
X    $_[1] = $in;
X    return \@list;
X}
X
Xsub parse_docdb
X{
X    my $record = shift;
X    my %record = (); 
X    while ( length($record) > 0 ) {
X       my $code  = unpack('C', $record);
X       my $flags = $code & 0300;
X       $code &= 077;
X       if ( $code > $#fields ) {
X           carp "Invalid item code: $code";    
X           last;
X       }
X       my $field = $fields[$code];
X       my $value;
X       if ( $list_fields{$field} ) {
X           $value = getlist($flags,$record);
X       } elsif ( $string_fields{$field} ) {
X           $value = getstring($flags,$record);
X           $value = decode_url($value) if $field eq 'URL';
X           if ( $field eq 'HEAD' && substr($value,0,2) eq "x\234" ) {
X                 if ( defined &inflateInit ) {
X                   my ( $i, $zstatus ) = inflateInit();
X                   ( $value, $zstatus ) = $i->inflate($value);
X                }
X                else {
X                    $value = 'Compressed data: Zlib not available';
X                }
X           }
X
X       } else {
X           $value = getnum($flags,$record);
X       }
X       return $value if $_[0] && $_[0] eq $field;
X       $record{$field} = $value;
X    }
X    return $_[0] ? '' : %record;
X}
X
Xsub decode_url {
X    local($_) = shift;
X    if ( $warning ) {
X        carp $warning;
X    }
X    else {
X       s/$matchchars/$url_parts[ord($&)-&FIRST_INTERNAL_SINGLECHAR]/oeg;
X    }
X    $_;
X}
X
Xsub encode_url {
X    local($_) = shift;
X    if ( $warning ) {
X       carp $warning;
X    }
X    else {
X       s/($url_parts)/$url_parts{$&}/eg;
X    }
X    $_;
X}
X
Xsub get_config {
X#
X#   The first argument is the name of an htdig config file.
X#   The second parameter, if present, is a hash ref that is
X#   to receive the config values.  The second parameter is
X#   used only for recursive calls in the case of included files.
X#   A hash ref is returned if on success, or undef if the
X#   file cannot be opened.
X#
X    my $file   = shift;
X    my $config = shift || {};
X    no strict 'refs';
X    if ( ! open $file, $file ) {
X       return undef;
X    }
X    while ( <$file> ) {
X        $_ .= <$file> while s/\\\n/ / && ! eof($file);
X       next if /^\s*(#|$)/;
X       ( my $key, $_ ) = /^\s*(\w+)\s*:\s*(.*)/;
X       if ( ! $key ) {
X           carp "Syntax error in $file (line $.)";
X           next;
X       }
X       s/\${(\w+)}/$config->{$1} || ''/ge;     # variable substitution
X       s/`([^`]+)`/file_contents($1,$file)/ge; # file substitution
X       if ( $key eq 'include' ) {
X           $_ = "$1/$_" if ! m|^/| && $file =~ m|(.*)/|;
X           get_config( $_, $config );
X       }
X       else {
X           $config->{$key} = $_;
X       }
X    }
X    close $file;
X    use strict 'refs';
X    my $parts = $config->{common_url_parts};
X    set_url_parts( defined($parts) ? split( ' ', $parts ) :
X                                    @default_url_parts         );
X    $warning .= "URL translation can't handle url_part_aliases.\n"
X        if $config->{url_part_aliases};
X    return $config;
X}
X
Xsub file_contents {
X#
X#   Return the contents of a file, with newlines
X#   replaced by a single space.
X#
X    if ( ! open FILE, $_[0] ) {
X        carp "Can't access included file $_[0] at $_[1] line $.";
X       return '';
X    }
X    my @file_contents = map { chomp, $_ } <FILE>;
X    close FILE;
X    return "@file_contents";
X}
X
X1;
END_OF_FILE
if test 9693 -ne `wc -c <'Database.pm'`; then
    echo shar: \"'Database.pm'\" unpacked with wrong size!
fi
# end of 'Database.pm'
fi
if test -f 'listconfig' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'listconfig'\"
else
echo shar: Extracting \"'listconfig'\" \(631 characters\)
sed "s/^X//" >'listconfig' <<'END_OF_FILE'
X#! /usr/local/bin/perl -w
X#
X# listconfig -- list the contents of an ht://Dig configuration file
X#
X# Usage:  listconfig config_file
X#
X# This primarily intended as a test and demo of the get_config
X# function in the HtDig::Database module.
X#
X# $Id: listconfig,v 1.1 2000/01/19 00:17:02 wjones Exp $
X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/listconfig,v $
X
Xuse HtDig::Database qw(:all);
Xuse strict;
X
Xmy $config_file = shift
X    or die "usage: $0 config_file\n";
X
Xmy $config = get_config( $config_file )
X    or die "$0: can't access $config_file\n";
X
Xfor ( sort keys %$config ) {
X    print "$_:\t$config->{$_}\n";
X}
X
END_OF_FILE
if test 631 -ne `wc -c <'listconfig'`; then
    echo shar: \"'listconfig'\" unpacked with wrong size!
fi
chmod +x 'listconfig'
# end of 'listconfig'
fi
if test -f 'listdocdb' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'listdocdb'\"
else
echo shar: Extracting \"'listdocdb'\" \(2207 characters\)
sed "s/^X//" >'listdocdb' <<'END_OF_FILE'
X#! /usr/local/bin/perl -w
X#
X# listdocdb -- List ht://Dig document database
X#
X# Usage:  listdocdb [-v [-v]] config_file [ docdb_file ]
X#
X# By default, only the URL's are listed from the database.
X# With a single "-v" flag, other fields are also listed,
X# including the first 60 characters of the HEAD excerpt.
X# With two "-v" flags, the full text of the HEAD excerpt
X# is shown.
X#
X# If the docdb file isn't specified on the command line,
X# the path will be found in the config file, or if it
X# can't be found there, a guess will be made based on
X# the path to the config file.
X# 
X# $Id: listdocdb,v 1.1 2000/01/19 00:17:02 wjones Exp $
X# $Source: /home/wjones/src/CVS.repo/htdig/local-additions/listdocdb,v $
X
Xuse DB_File;
Xuse HtDig::Database qw(:all);
Xuse strict;
X
Xmy $verbose = 0;
X
Xwhile ( @ARGV && $ARGV[0] eq '-v' ) {
X    $verbose++;
X    shift;
X}
X
Xmy ( $config_file, $docdb ) = @ARGV;
X
Xdie "Usage: $0 [-v] config_file [docdb_file]\n" if ! $config_file;
X
Xmy $config = get_config( $config_file ) or
X    die "$0: Can't access $config_file\n";
X
Xif ( ! $docdb ) {
X
X    # If database file isn't specified on the command line,
X    # get if from the config file, or guess based on path
X    # to config file.
X   
X    my $database_base = $config->{database_base};
X    if ( ! $database_base ) {
X       my $database_dir = $config->{database_dir};
X       if ( ! $database_dir ) {
X           my ( $config_dir ) = ( $config_file =~ m|(.*)/| );
X           $database_dir = $config_dir ? "$config_dir/../db" : "../db";
X       }
X       $database_base = "$database_dir/db";
X    }
X    $docdb = "$database_base.docdb";
X}
X
Xmy %docdb;
Xtie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
X    die "$0: Unable to open $docdb: $!";
X
Xwhile ( my ( $key, $value ) = each %docdb ) {
X    next if $key =~ /^nextDocID/;
X    if ( ! $verbose ) {
X        print decode_url( $key ), "\n";
X    }
X    else {
X        my %rec = parse_docdb( $value );
X       for ( sort keys %rec ) {
X           my $field = $rec{$_};
X           $field = join( "\n\t\t", @$field ) if ref($field) eq 'ARRAY';
X           $field = localtime( $field )       if /^(TIME|ACCESSED)$/;
X           $field = substr( $field, 0, 60 )   if /^HEAD$/ && $verbose < 2;
X           printf "%13s:  %s\n", $_, $field;
X       }
X       print '='x60, "\n";
X    }
X}
X
END_OF_FILE
if test 2207 -ne `wc -c <'listdocdb'`; then
    echo shar: \"'listdocdb'\" unpacked with wrong size!
fi
chmod +x 'listdocdb'
# end of 'listdocdb'
fi
echo shar: End of shell archive.
exit 0

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to