We had some interest in indexing a number of images and photographs
at our site, many of which are online on the Web. I was wondering if
I could use htDig to help do this.

I realized that hdDig saved keywords from the link in the referring
document, such as
  <a href="eagle.jpeg">picture of an eagle</a>
or even
  <img src="osprey.jpg" alt="picture of an osprey">
but that normally these are discarded, since image/jpeg is not an
indexable type.

Without making any substantial changes to htDig, but by
creating an external parser for image data types, I can
retain these keywords and thus index images.

The first version of the parser read only a few kb from the beginning of
the image file, enough to extract metadata such as image size, comments
etc, and returned  text with a magic word "XFILE". It was the possible to
search for images in a mostly text corpus using a string such as
"xfile eagle".

The next version read the entire image (which in some cases required a
doc_size of 1Mb or more) and created a thumbnail image stored locally
under a unique name. When a context string is created with a link to the
thumbnail, it is possible to put inline thumbnails in the search results
in a slimilar way to Google or AltaVista image search.

In order to do this I backed out some code to de-fang HTML in the context
text. It is not quite right as there is a problem with bolding if text in
the link itself matches (such as the filename)

Example search at
http://andrew.triumf.ca/htdig/search.html
e.g. "xfile chamber"

scripts at
http://andrew.triumf.ca/htdig/mods/


The basic idea (providing a script to index a non-text media type based
on filename, metadata and link text) will work with an unmodified htdig,
hence the cc. to htdig-general


Andrew Daviel
TRIUMF


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to