Date: Fri, 28 Apr 2000 01:38:27 -0400
From: Sandy Harris <[EMAIL PROTECTED]>
John Gilmore wrote:
>
> [I think we need software for automatically extracting the words from PDF
> and MS-Word documents so they can be found in web searches. It looks like
> the bad guys are deliberately putting lots of interesting stuff in PDF
> to make it hard to find and read. --gnu]
For GPLd software that reads Word files, see:
http://www.wvWare.com/
http://www.csn.ul.ie/~caolan/docs/MSWordView.html
Go take a look at the New Zealand Digital Library project,
http://www.nzdl.org/fast-cgi-bin/library. In particular,
go halfway down the page and click on Technology, and then
on (for instance) PreScript.
They've got lots of GPL'ed software for extracting text from
Postscript (better than the other solutions out there, apparently),
generating keywords by scanning documents, generating web pages, etc
etc. Cool stuff. I haven't tried actually running or reviewing their
code, but I like what I've seen so far. They may have other tools for
extracting stuff from other document types, or have them in development.
Also, http://www.cora.justresearch.com/Artificial_Intelligence/Agents/index.html
used to be a search engine that spidered Postscript (not PDF), and
mostly CS papers. However, it appears to have been taken down (no
such host). http://www.justresearch.com/publications/publications.html,
which points to Cora, gets a 404 if you follow their own link.
However, there are a number of publications there, and perhaps the
search engine itself exists somewhere.
It's true---we need engines that search Word, PDF, and PS. The
biggies should do it for us, but perhaps there's a good niche for a
spider that spiders -only- these document types if the big engines
won't. Unfortunately, unless the Word/PDF/PS engine wants to spider
-all- pages itself, it probably has to ask other search engines for
pages with suspiciously-named links on them---which has its own
problems. And, of course, it must pull down a disproportionately-large
number of bytes per searched document compared to normal search
engines ('cause they're always so large), which is no doubt one of the
reasons why the normal search engines don't bother.
This might be an excellent thing to try to pursuade the Google people
to do. After all, they seem to concentrate on the same sorts of pages
which are likely to have lots of PDF/PS content anyway.
P.S. I wouldn't say it's a deliberate campaign to obfuscate---it's
just an artifact of them using Microsoft products, which default to
Word, and for which it's easiest to then go straight to PDF as part of
printing. As usual, laziness is often indistinguishable from malice.