Re: [htdig3-dev] No PostScript parsing -- at all

Gilles Detillieux Tue, 9 Feb 1999 14:17:01 -0500

According to Hans-Peter Nilsson:
> While tracing the whereabouts of some "spuriously deleted"
> documents, I came to debug the Postscript::parse() function.
>  It's just that there's not much to trace -- it immediately
> returns (line 56).
>  Looking in this years mailing lists contents, it seems that
> people think that ht://Dig can actually parse PostScript, and
> someone posted a problem description about not getting any
> output while indexing PostScript documents.  Small wonder... 
> 
> This "disabling" of PostScript parsing predates CVS logs.

Judging from the comments at the start of the file, it looks like this
souce hasn't been touched in almost two years.  It seems very much to
be a work in progress, which was left for quite some time.

> Now, if I enable it by removing the "return", everything seems
> to work as expected, but debugging output appears; there are
> "naked" cout writes (not testing the "debug" flag).
>  Work "as expected" I say, because all words in PostScript files
> are not complete or easily parseable words; often one or two
> characters are expressed in ways that the PostScript parser
> cannot grok, so a chopped or otherwise munged word is indexed.
> See for example <URL:http://egcs.cygnus.com/scheduler.ps>.
> 
> This leads me to think that the PostScript parser is not as
> complete as needed, and possibly "disabled" for a good reason.
> Maybe it should be rewritten, using PDF.cc, or maybe the PDF
> parser has the same problems.

Parsing generic PostScript files is not an easy job.  PDF.cc is written
to handle the PostScript that Acrobat Reader spits out.  Recent versions
of xpdf spit out PostScript that is compatible with this, so it works
too, but PDF.cc is far from a generic PostScript parser.  Postscript.cc
seems to tackle some specific flavour of PostScript file, but it doesn't
look like it would handle most popular flavours without a lot of work.
The problem is that PostScript is so flexible, that you either have to
limit your parsing to output that follows a rigid specification, or write
a full PostScript interpreter to deal with whatever styles are out there.

> I don't know.  Maybe someone has some good answers?

I'd recommend not reinventing the wheel.  Instead of a builtin parser,
it would make a lot more sense to build an external parser around
ghostscript.  Its "ps2ascii" program, which is just a script that calls gs
with specific options, would be a good starting point.  You could modify a
script like contrib/htparsedoc/parse_word_doc.pl to use ps2ascii instead
of catdoc, and change the title it spits out.  That would probably make
a decent external PostScript parser.  I haven't tried it, though.

> Sidenote:
>  If your local_urls documents are stored with a time before era
> (1 Jan 1970), they may (linux) have a date older than nothing
> (negative date if your time_t is signed), and will not be
> indexed.   See Document.cc around line 550 (date is zero for
> newly encountered documents).
>  Not that this urgently needs fixing at this level; maybe a debug
> output saying "Whoops!  You have some really old documents here"
> is in order (I may fix).

Is anyone indexing documents that are really that old, or is it a problem
with incorrectly set modification times?  It would be pretty easy to sniff
out these files.  E.g.:

        find /home/httpd -mtime +10500 -ls

Modification times that predate 1970 may be a problem with other programs
as well, so I'd recommend finding and fixing them.  These usually indicate
that some program stomped on the modification time, e.g. by zeroing it out.

>  Hope all systems get a 64-bit time_t -- or at least unsigned --
> before 2038...

Let's just get past the next year first, OK?  ;-)  Hopefully there
won't be the media induced hysteria surrounding "t32b" that there's been
about y2k!  If we haven't ported all our code to 64 or 128 bit processors
by then, we get what we deserve, right?  :-P

In any case, as long as file times are stored relative to the Unix Epoch,
then anything before that will appear as a negative number, even with a
64 bit time_t.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig3-dev] No PostScript parsing -- at all

Reply via email to