Re: using doc2html (was [htdig] using conv_doc.pl to index MS Word documents)

shams khan Tue, 19 Nov 2002 05:57:15 -0800

Hi List,

I am having difficulty in indexing pdf and word documents


.doc word documents just do not index and .pdf documents are indexed but
give me garbled info in the search results.

I've checked over the setup of doc2html, pdf2html, CATDOC and XPDF
carefully, and I can't see where I am going wrong.

When I run rundig, I get the following error on word documents:

        3:3:1:http://10.5.1.35/worddocument.doc:  !      UNABLE to convert
size  =  8060

        Deleted, no excerpt:  0/http://10.5.1.35/worddocument.doc

PDF documents seem to be indexed okay, as I get the following message:

        14:14:1:http://10.5.1.35/pdfdocuement/.pdf:     size  =  22350

BUT, in the search results for a pdf document I get the following formatted
results:

[msag_3_1_theassesment.pdf]
%PDF-1.3 %���� 15 0 obj << /Linearized 1 /O 17 /H [ 1120 227 ] /L 30732 /E
16677 /N 4 /T 30314 >> endobj xref 15 34 0000000016 00000 n 0000001027 00000
n 0000001347 00000 n 0000001554 00000 n 0000001761 00000 n 0000001800 00000
n 0000002309 00000 n 0000002511 00000 n 0000002699 00000 n 0000003100 00000
n 0000003121 00000 n 0000003830 00000 n 0000004011 00000 n 0000004409 00000
n 0000004430 00000 n 0000005141 00000 n 0000005162 00000 n 0000005917 00000
n 0000005938 00000 n 0000006724 00000 n 0000007175 00000 n 0000007371 00000
n 0000007392 00000 n 0000008115 00000 n 0000008136 00000 n 0000008883 00000
n 0000008904 00000 n 0000009699 00000 n 0000013119 00000 n 0000013140 00000
n 0000013771 00000 n 0000013849 00000 n 0000001120 00000 n 0000001326 00000
n trailer << /Size 49 ...
http://10.5.1.35/msag_3_1_theassesment.pdf 10/28/02, 30732 bytes

Whereas, using conv_doc.pl for pdf documents, I get:the following formatted
results (I dont know why it gives it the title of a Microsoft Word
document!... it *is* a pdf document):

Microsoft Word - msag_3_1_theassesment.doc
Management Self-Assessment Guide MSAG 3.1 Originated by: Approved by: Page 1
of 4 Version: 1.0 3.1 The Assessment A. Business Strategy & Planning TRUE
FALSE 1. Our Company has a Mission Statement for the business, which
reflects and incorporates our company values. 2. Our company has a corporate
or business plan, which is consistent with this mission statement. 3. We
have a clear sense of direction, which is evident in the objectives and
targets we have set for the company. 4. Our strategic direction is based on
a proper evaluation of the marketplace and the opportunities therein. 5. We
have a management team who regard the business plan as their own and are
committed to driving the company forward to meet its objectives. 6. All our
employees share our vision of the future. Total B. Marketing ...
http://10.5.1.35/msag_3_1_theassesment.pdf 10/28/02, 30732 bytes

----------------------------------------------------------------------------
------------

Details of my setup are below:

I've used the doc2html.pl and pdf2html.pl scripts that ship with htdig
3.1.6.  I've moved both scripts to /usr/local/bin/

I've edited my htdig.conf file so that these lines are included:

        external_parsers:    application/pdf->text/html
/usr/local/bin/doc2html.pl  \
                                     application/msword->text/html
/usr/local/bin/doc2html.pl

I downloaded the 0.90.3 (stable) release of CATDOC from
http://www.ice.ru/~vitus/catdoc/
And then installed (using ./configure, make, make install)

I've also installed XPDF using their instructions

I've included the config section of my doc2html.pl and below, and also my
pdf2html.pl below,

Can anyone suggest what I can do?

Thanks for any help!!!

Shams


----------------------------------------------------------------------------
------------

MY DOC2HTML.PL :

#!/usr/bin/perl
use strict;
#
# Version 3.0 4-June-2001
#
# External converter for htdig 3.1.4 or later (Perl5 or later)
# Usage: (in htdig.conf)
#
#external_parsers: application/rtf->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   text/rtf->text/html /opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   application/pdf->text/html /opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   application/postscript->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   application/msword->text/html /opt/local/htdig-3.1.5/scripts/doc2html.pl
\
#   application/wordperfect5.1->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   application/msexcel->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   application/vnd.ms-excel->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   application/vnd.ms-powerpoint->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl
#   application/x-shockwave-flash->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl \
#   application/x-shockwave-flash2-preview->text/html
/opt/local/htdig-3.1.5/scripts/doc2html.pl
#
#  Uses wp2html to convert Word and WordPerfect documents into HTML, and
#  falls back to using Catdoc for Word and Catwpd for WordPerfect if
#  Wp2html is unavailable or unable to convert.
#
#  Uses range of other converters as available.
#
#  If all else fails, attempts to read file without conversion.
#
############################################################################
############
# Written by David Adams <[EMAIL PROTECTED]>.
# Based on conv_doc.pl written by Gilles Detillieux
<[EMAIL PROTECTED]>,
#   which in turn was based on the parse_word_doc.pl script, written by
#   Jesse op den Brouw <[EMAIL PROTECTED]>.
############################################################################
############

# Install Sys::AlarmCall if you can
eval "use Sys::AlarmCall";

########  Full paths to conversion utilities  ##########
########          YOU MUST SET THESE          ##########
########  (comment out those you don't have)  ##########

# Wp2html converts Word & Wordperfect to HTML
# (get it from: http://www.res.bbsrc.ac.uk/wp2html/)
my $WP2HTML = '';

#Catwpd for WordPerfect to text conversion
# (you don't need this if you have wp2html)
# (get it from htdig site)
my $CATWPD = '';

# rtf2html converts Rich Text Font documents to HTML
# (get it from http://www.fe.msk.ru/~vitus/catdoc/)
my $RTF2HTML = '';

# Catdoc converts MS Word to plain text
# (get it from: http://www.fe.msk.ru/~vitus/catdoc/)

#version of catdoc for Word6, Word7 & Word97 files:
my $CATDOC = '/usr/local/bin/catdoc';

#version of catdoc for Word2 files
my $CATDOC2 = $CATDOC;

#version of catdoc for Word 5.1 for MAC
my $CATDOCM = $CATDOC;

# PostScript to text converter
# (get it from the ghostscript 3.33 (or later) package)
my $CATPS = '';

# add to search path the directory which contains gs
# (edit for your environment)
$ENV{PATH} .= ":/usr/freeware/bin";

# PDF to HTML conversion script
# Full pathname of Perl script pdf2html.pl
my $PDF2HTML = '/usr/local/bin/pdf2html';

#Microsoft Excel to HTML converter
# (get it from www.xlHtml.org)
my $XLS2HTML = '';

#MicroSoft Excel to .CSV converter
# (you don't need this if you have xlHtml)
# (if you do want it, you can get it with catdoc)
my $CATXLS = '';

#Microsoft Powerpoint to HTML converter
# (get it from www.xlHtml.org)
my $PPT2HTML = '';

#Shockwave Flash
# (extracts links from file)
# Full pathname of Perl script swf2html.pl
my $SWF2HTML = '';

########################################################################

----------------------------------------------------------------------------
------------

MY PDF2HTML.PL :

#!/usr/bin/perl -w
use strict;
#
# Version 1.0 25-May-2001
# Written by David Adams <[EMAIL PROTECTED]>
#
# Uses pdftotext & pdfinfo utilities from the xpdf package
# to read an Adobe Acrobat file and produce HTML output.
#
# Can be called directly from htdig as an external converter,
#  or may be called by doc2html.pl converter script.
#

####--- Configuration ---####
# Full paths of pdtotext and pdfinfo
# (get them from the xpdf package at http://www.foolabs.com/xpdf/):

#### YOU MUST SET THESE  ####

my $PDFTOTEXT = "/usr/bin/pdftotext";
my $PDFINFO = "/usr/bin/pdfinfo";
#
# De-hyphenation option (only affects end-of-line hyphens):
my $Dehyphenate = 1;
#
# Set title to be used when none is found:
my $Default_title = "Adobe Acrobat Document";
#
# make portable to win32 platform or unix:
my $null = "/dev/null";
if ($^O eq "MSWin32") {$null = "nul";}
####--- End of configuration ---###

if (! -x $PDFTOTEXT) { die "Unable to execute pdftotext" }

my $Input = $ARGV[0] || die "Usage: pdf2html.pl filename [mime-type] [URL]";
my $MIME_type = $ARGV[1] || '';
if ($MIME_type and ($MIME_type !~ m#^application/pdf#i)) {
  die "MIME/type $MIME_type wrong";
}

my $Name = $ARGV[2] || '';
$Name =~ s#^.*/##;
$Name =~ s/%([A-F0-9][A-F0-9])/pack("C", hex($1))/gie;

&pdf_head;
&pdf_body;
exit;

#---------------------------------------------------------------------------
---

sub pdf_head {
#
#  Contributed by Greg Holmes and Michael Fuller
#   (any errors by David Adams)
#
    my $title = '';
    my $subject = '';
    my $keywords = '';
    if (open(INFO, "$PDFINFO '$Input' 2>$null |")) {
        while (<INFO>) {
            if (m/^title:/i) {
                s/^title:\s+//i;
  $title = &clean_pdf($_);
     } elsif (m/^subject:/i) {
                s/^subject:\s+//i;
                $subject = &clean_pdf($_);
            } elsif (m/^keywords:/i) {
                s/^keywords:\s+//i;
                $keywords = &clean_pdf($_);
            }

        }
        close INFO;
    } else { warn "cannot execute pdfinfo" }
    if (not length $title) {
      if ($Name) {
        $title = '[' . $Name . ']';
      } else {
        $title = $Default_title;
      }
    }

    print "<HTML>\n<HEAD>\n";
    print "<TITLE>$title</TITLE>\n";
    if (length $subject) {
      print '<META NAME="DESCRIPTION" CONTENT="' . $subject. "\">\n";
    }
    if (length $keywords) {
      print '<META NAME="KEYWORDS" CONTENT="' . $keywords . "\">\n";
    }
    print "</HEAD>\n";

###print STDERR "\n$Name:\n";
###print STDERR "\tTitle:\t$title\n";
###print STDERR "\tDescription:\t$subject\n";
###print STDERR "\tKeywords:\t$keywords\n";

}

#---------------------------------------------------------------------------
---

sub pdf_body {

  my $bline = '';
  open(CAT, "$PDFTOTEXT -raw '$Input' - |") ||
   die "$PDFTOTEXT doesn't want to be opened using pipe\n";
  print "<BODY>\n";
  while (<CAT>) {
    while ( m/[A-Za-z\300-\377]-\s*$/ && $Dehyphenate) {
   $_ .= <CAT>;
   last if eof;
   s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s;
    }
    s/\255/-/g; # replace dashes with hyphens
    # replace bell, backspace, tab. etc. with single space:
    s/[\000-\040]+/ /g;
    $_ = &HTML($_);
    if (length) {
      print $bline, $_, "\n";
      $bline = "<br>\n";
    } else {
      $bline = "<p>\n";
    }
  }
  close CAT;

  print "</BODY>\n</HTML>\n";
  return;
}

#---------------------------------------------------------------------------
---

sub HTML {

  my $text = shift;

  $text =~ s/\f/\n/gs; # replace form feed
  $text =~ s/\s+/ /g; # replace multiple spaces, etc. with a single space
  $text =~ s/\s+$//gm; # remove trailing space
  $text =~ s/&/&amp;/g;
  $text =~ s/</&lt;/g;
  $text =~ s/>/&gt;/g;
  chomp $text;

  return $text;
}

#---------------------------------------------------------------------------
---

sub clean_pdf {
# removes odd pair of characters that may be in pdfinfo output
# Any double quotes are replaced with single

  my $text = shift;
  chomp $text;
  $text =~  s/\376\377//g;
  $text =~  s/\"/\'/g;
  return $text;
}



----- Original Message -----
From: "David Adams" <[EMAIL PROTECTED]>
To: "shams khan" <[EMAIL PROTECTED]>; "ht://Dig"
<[EMAIL PROTECTED]>
Cc: "Gilles Detillieux" <[EMAIL PROTECTED]>
Sent: Friday, November 08, 2002 9:17 AM
Subject: Re: [htdig] using conv_doc.pl to index MS Word documents


> From: "shams khan" <[EMAIL PROTECTED]>
> To: "ht://Dig" <[EMAIL PROTECTED]>
> Cc: "Gilles Detillieux" <[EMAIL PROTECTED]>
> Sent: Tuesday, November 05, 2002 9:34 AM
> Subject: Re: [htdig] using conv_doc.pl to index MS Word documents
>
>
> >
> > Hi,
> >
> > I've tried doc2html.pl, but am also having problems indexing word
> documents.
> > I have the following line in the doc2html.pl script:
> >
> > #version of catdoc for Word6, Word7 & Word97 files:
> > my $CATDOC = '/usr/local/bin';
>
> $CATDOC should be the full path name of the catdoc executable, not the
> directory containing it.
>
> >
> > And, I have the catdoc package installed (with the catdoc binaries in
> > /isr/local/bin and /usr/local/lib)
> >
> > After changing the line in htdig.conf to use doc2html.pl, I get the
> > following error messages rundig tries to index word documents:
> >
> > http://10.5.1.35/sme/micro/test.doc: !          UNABLE to convert  size
=
> > 11264
> >
> > any suggestions on what could be wrong ?
> >
> > Also, is there any benefit of using doc2html.pl over conv_doc.pl to
index
> > .pdf documents for htDig. ?
> >
> > I have been using conv_doc.pl and it has been giving me very
satisfactory
> > results, I did try doc2html.pl as well to see if there was any
> difference...
> > however with doc2html.pl I found that less pdf documents were indexed
and
> > all the excerpts on htDig search results were garbled.
> >
> > (e.g. 0000000016 00000 n 0000001025 00000 n 0000001337 00000 n
0000001543
> > 00000 n 0000001750 00000 n 0000001789 00000 n 0000002255 00000 n
> 0000002455
> > 00000 n 0000002643 00000 n 0000003042 00000 n 0000003064 00000 n 00000).
> >
> > Thanks for your help,
> >
> > Shams
> >
>
> The pdf2html.pl script that comes with doc2html uses pdfinfo to get the
> document title, subject and keywords.   I don't think that conv_doc.pl
does
> as much.
>
> If you are getting inferior results with doc2html.pl then your first step
> should be to check over your installation of doc2html.pl and pfd2html.pl
> very carefully.
>
> --
> David Adams
> Information Systems Services
> Southampton University
>


-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: using doc2html (was [htdig] using conv_doc.pl to index MS Word documents)

Reply via email to