Hello, Adonis
----- Original Message -----
From: "Adonis El Fakih" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, March 23, 2001 3:08 PM
Subject: [aseek-users] deciphering delta files
> Hi,
>
> I am in the progress of writing a perl program to decipher the delta
files, so I can selectively remove data in the database.
>
> In an earlier post I was told that the file formats are as follows:
>
> Information about URLs and sites where word is encountered is kept in
> either BLOB wordurl.urls or in binary file var//NNw/W
> where W is wordurl.word_id and NN is word_id mod 100.
> Format of BLOB:
> Sites section
> (4) Offset of URLs for site 0
> (4) Site ID of site 0
> (4) Offset of URLs for site 1
> (4) Site ID of site 1
> ...
> (4) Offset of URLs for site Max
> (4) Site ID of site Max
> (4) Offset of EOF
> URLs
> (4) URL ID of URL 0 of site 0
> (2) Word count
> (2*word count) Sorted array of word positions
> (4) URL ID of URL 1 of site 0
> (2) Word count
> (2*word count) Sorted array of word positions
>
> etc...
>
> Let us assume I want to read the first record, should I read 4 and 4 and
analyze the second 4 bytes? what is the purpose of the 4 offset? I do not
understand the table above. Also in perl I am using unpack. What format
should I use?? I tried
"Offset" is the offset of URL info relative to the beginning of
file/BLOB.
First four bytes is the offset of URL block and the size of sites block
You can read entire sites block and process each site one by one. Offset
of beginnig of URL info is stored before site ID, offset of ending of URL
info is stored after site ID.
> unpack("c", $buf)
>
> where $buf was the read using
>
> read BLOB, $off,4
> read BLOB, $buf,4
>
> from the beging of the file. I just wanted to see what it contained.
>
> I appretiate any hints so I can read the site ids, and records associated
with a delta.
>
> thanks in advance
> adonis
>