Hi all,

I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version
0.5.0-200611082313) to Index a collection of ARC files generated by a web
crawl using the Heritrix web crawler (Version 1.4.0).

When I check the metadata tag on the wera front-end the following list of
tags are displayed

ARC Identifier
URL
Time of Archival
Last Modified Time
Mime-Type
File Status
Content Checksum
HTTP Header

When I click on the explain link in the NutchWax front-end the following
list of tags are displayed

Segment
Digest
Date
ARCDate
Encoding
Collection
ARCName
ARCOffset
ContentLength
PrimaryType
subType
URL
Title
Boost

Is there a full list of the metadata fields that NutchWax/Nutch creates when
indexing? I'm particularly interested in tags relating to the actual content
on each page i.e. content type, description etc etc
When searching does NutchWax/Nutch search across such tags or just across
the parsed text of each page for occurances of keywords etc?

Any help you can provide would be greatly appreciated!

Shay

Reply via email to