Hi,
I am in the progress of writing a perl program to decipher the delta files, so I can
selectively remove data in the database.
In an earlier post I was told that the file formats are as follows:
Information about URLs and sites where word is encountered is kept in
either BLOB wordurl.urls or in binary file var//NNw/W
where W is wordurl.word_id and NN is word_id mod 100.
Format of BLOB:
Sites section
(4) Offset of URLs for site 0
(4) Site ID of site 0
(4) Offset of URLs for site 1
(4) Site ID of site 1
...
(4) Offset of URLs for site Max
(4) Site ID of site Max
(4) Offset of EOF
URLs
(4) URL ID of URL 0 of site 0
(2) Word count
(2*word count) Sorted array of word positions
(4) URL ID of URL 1 of site 0
(2) Word count
(2*word count) Sorted array of word positions
etc...
Let us assume I want to read the first record, should I read 4 and 4 and analyze the
second 4 bytes? what is the purpose of the 4 offset? I do not understand the table
above. Also in perl I am using unpack. What format should I use?? I tried
unpack("c", $buf)
where $buf was the read using
read BLOB, $off,4
read BLOB, $buf,4
from the beging of the file. I just wanted to see what it contained.
I appretiate any hints so I can read the site ids, and records associated with a delta.
thanks in advance
adonis