According to me:
> I'd recommend not reinventing the wheel. Instead of a builtin parser,
> it would make a lot more sense to build an external parser around
> ghostscript. Its "ps2ascii" program, which is just a script that calls gs
> with specific options, would be a good starting point. You could modify a
> script like contrib/htparsedoc/parse_word_doc.pl to use ps2ascii instead
> of catdoc, and change the title it spits out. That would probably make
> a decent external PostScript parser. I haven't tried it, though.
OK, so I can't turn down a challenge. I had to poke around with the
parse_word_doc.pl script anyway, to test the catdoc problem Jesse had
reported, so I decided to enhance the script to handle PostScript files
too.
In the process, I found a small bug in ExternaParser.cc - it didn't remove
the temporary file it uses. Here's the patch for that:
--- ./htdig/ExternalParser.cc.noremove Mon Feb 1 13:46:23 1999
+++ ./htdig/ExternalParser.cc Tue Feb 9 17:56:18 1999
@@ -139,6 +139,7 @@
FILE *input = popen(command, "r");
if (!input)
{
+ unlink(path);
return;
}
@@ -335,6 +336,7 @@
}
}
pclose(input);
+ unlink(path);
}
And here is my new parse_word_or_ps_doc.pl script. OK, a shorter
name is in order. By the way, the original parse_word_doc.pl in
contrib/htparsedoc got messed up - all the long lines are folded, which
perl really didn't like! Be careful that your mail program doesn't do
the same to this one, if you're going to use it. As you can see, this
script could be easily extended to handle any number of "something" to
text converters, as long as the file command can determine what the
file type is.
--------------------- (snip) ---------------------
#!/usr/local/bin/perl
# 1998/12/10
# Added: push @allwords, $fields[$x]; <[EMAIL PROTECTED]>
# Replaced: matching patterns. they match words starting or ending with
()[]'`;:?.,! now, not when in between!
# Gone: the variable $line is gone (using $_ now)
#
# 1998/12/11
# Added: catdoc test (is catdoc runnable?) <[EMAIL PROTECTED]>
# Changed: push line semi-colomn wrong. <[EMAIL PROTECTED]>
# Changed: matching works for end of lines now <[EMAIL PROTECTED]>
# Added: option to rigorously delete all punctuation <[EMAIL PROTECTED]>
# 1999/02/09
# Added: option to delete all hyphens <[EMAIL PROTECTED]>
# Changed: uses ps2ascii to handle PS files <[EMAIL PROTECTED]>
#########################################
#
# set this to your catdoc proggie
#
# get it from: http://www.fe.msk.ru/~vitus/catdoc/
#
$CATDOC = "/usr/local/bin/catdoc";
#
# set this to your PostScript to text converter
# get it from the ghostscript 3.33 (or later) package
#
$CATPS = "/usr/bin/ps2ascii";
# need some var's
@allwords = ();
@temp = ();
$x = 0;
@fields = ();
$calc = 0;
#
# okay. my programming style isn't that nice, but it works...
#for ($x=0; $x<@ARGV; $x++) { # print out the args
# print STDERR "$ARGV[$x]\n";
#}
open(FILE, "file $ARGV[0] |") || die "Hmmm. Can't determine file type.\n";
if (<FILE> =~ /:\s*PostScript/) {
$parse = "(cd /tmp; $CATPS; rm -f _temp_.???) < $ARGV[0] |";
$type = "PostScript";
die "Hmm. ps2ascii is absent or unwilling to execute.\n" unless -x $CATPS;
} else {
$parse = "$CATDOC -a -w $ARGV[0] |";
$type = "Word";
die "Hmm. catdoc is absent or unwilling to execute.\n" unless -x $CATDOC;
}
close FILE;
#
# open it
open(CAT, "$parse") || die "Hmmm. parser doesn't want to be opened using pipe.\n";
while (<CAT>) {
s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/
/g; # replace reading-chars with space (only at end or begin of word)
# s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g; # rigorously replace all by
<[EMAIL PROTECTED]>
s/-/ /g; # replace hyphens with space
@fields = split; # split up line
next if (@fields == 0); # skip if no fields (does it
speed up?)
for ($x=0; $x<@fields; $x++) { # check each field if string
length > 3
if (length($fields[$x]) > 3) {
push @allwords, $fields[$x]; # add to list
}
}
}
close CAT;
#############################################
# print out the title
@temp = split(/\//, $ARGV[2]); # get the filename, get rid of basename
print "t\t$type Document $temp[-1]\n"; # print it
#############################################
# print out the head
$calc = @allwords;
print "h\t";
#if ($calc >100) { # but not more than 100 words
# $calc = 100;
#}
for ($x=0; $x<$calc; $x++) { # print out the words for the exerpt
print "$allwords[$x] ";
}
print "\n";
#############################################
# now the words
for ($x=0; $x<@allwords; $x++) {
$calc=int(1000*$x/@allwords); # calculate rel. position (0-1000)
print "w\t$allwords[$x]\t$calc\t0\n"; # print out word, rel. pos. and text
type (0)
}
$calc=@allwords;
#print STDERR "# of words indexed: $calc\n";
--------------------- (snip) ---------------------
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.