Re: [htdig-dev] binary document-database format questions

Geoff Hutchison Wed, 04 Sep 2002 09:02:16 -0700

On Wed, 4 Sep 2002, Walantis Giosis wrote:

> The ID bytes for length informations (excerpt length, docume size, URL
> length) varies. Say we have a document size of less than 100h bytes.
> Then the ID byte has the value 44h for that information. The size
> needs only one byte. If the size exceeds 100h bytes (it needs two or
> more bytes) then the ID byte has the value 84h. What's the logic
> behind this ? Only to determine the byte count for the size ? At the
> moment I've handled it using a switch/case statement.


Hans-Peter Nilsson rewrote the Serialize/Deserialize routines very
carefully, so I can't speak authoritatively.  I think he was trying to
save as much space as possible. AFAICT, there's a marker indicating that
the next variable coming up is sizeof() whatever.

Take a look at htcommon/DocumentRef.cc::Serialize() to see the code.

> And why is the document size information stored twice in the database ?

They should be different. See htcommon/DocumentRef.[cc,h] which deals with
the document DB records. In particular, there's the text size of the
database and optionally, it can figure out the size of the document
including all images.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] binary document-database format questions

Reply via email to