According to ronald:
> when htdig exports results from an index as textformat it generates two
> files. The files look like this :
> 
> file1:
> 0     u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software   a:0    
> m:936027636     s:373   h:      h:      l:940510479     L:2     I:373   
>d:http://www.htdig.org/www.htdig.orght://Dig Search Software (yes, the developers 
>use it)ht://DigParent Directory   A:

First field:    doc ID
u:              URL of doc
t:              doc title
a:              doc state (refer to source)
m:              date/time last modified, sec since 1970-01-01 00:00:00 UTC
s:              doc size in bytes
h:              doc head (excerpt of first max_head_length bytes of doc)
h: (2nd)        meta description contents
                (this 2nd h is a bug - it really should be a unique value
                 like D or something)
l:              date/time document was indexed (sec since 1970)
L:              no. of links doc has to other docs
I:              "docImageSize" - has nothing to do with images, but seems to
                contain document size, and may be cumulative in some
                circumstances - can anyone else make any sense of this?
d:              link descriptions - text of links to this doc, ^A separated
A:              anchor names (bookmarks) in doc, ^A separated

All fields are tab (^I) separated.  Sub-fields of d & A use ^A separator.
doc head field has all runs of white space (space, tab, newline, etc.)
collapsed to single spaces.

> file2:

This is db.wordlist...

> 01oct99       i:115   l:0     w:100998        c:2
> 01oct99       i:116   l:0     w:100998        c:2
> 01oct99       i:45    l:6     w:100381        c:2
> 01oct99       i:46    l:0     w:100998        c:2
> 02aug1999     i:48    l:361   w:639   a:2
> 02jun1999     i:50    l:262   w:1382  c:2     a:2
> 02mar1999     i:53    l:378   w:622   a:2
> 02may1999     i:51    l:280   w:1349  c:2     a:2

First field:    indexed word (lower case)
i:              doc ID (to match up with records from above)
l:              location of word in doc (0-1000, i.e. tenth of a percent units)
w:              weight of word in searches
c:              no. of occurrences of word in document, if > 1
a:              index into "A:" list above, to indicate which anchor name,
                if any, preceded this word

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.

Reply via email to