Hi,

I am in the progress of writing a perl program to decipher the delta files, so I can 
selectively remove data in the database.

In an earlier post I was told that the file formats are as follows:

Information about URLs and sites where word is encountered is kept in
either BLOB wordurl.urls or in binary file var//NNw/W
where W is wordurl.word_id and NN is word_id mod 100.
Format of BLOB:
Sites section
(4) Offset of URLs for site 0
(4) Site ID of site 0
(4) Offset of URLs for site 1
(4) Site ID of site 1
...
(4) Offset of URLs for site Max
(4) Site ID of site Max
(4) Offset of EOF
URLs
(4) URL ID of URL 0 of site 0
(2) Word count
(2*word count) Sorted array of word positions
(4) URL ID of URL 1 of site 0
(2) Word count
(2*word count) Sorted array of word positions

etc...

Let us assume I want to read the first record, should I read 4 and 4 and analyze the 
second 4 bytes? what is the purpose of the 4 offset? I do not understand the table 
above.  Also in perl I am using unpack.  What format should I use?? I tried 
unpack("c", $buf) 

where $buf was the read using 

read BLOB, $off,4
read BLOB, $buf,4

from the beging of the file.  I just wanted to see what it contained. 

I appretiate any hints so I can read the site ids, and records associated with a delta.

thanks in advance
adonis

Reply via email to