Example of BLOB for word, which is found in 2 URLs from 1 site
Hex Hex
offset length value
0 4 C (points to the first URL of site with ID=1)--+
4 4 1 (site ID)
|
8 4 1C (points to the end of file)--+
|
C 4 1 (url ID)<------------------+----------------+
10 2 1 (word count) |
12 2 1 (word position) |
14 4 2 (url ID) |
18 2 1 (word count) |
1A 2 1 (word position) |
<-----------------------------------------------+
Alexander
----- Original Message -----
From: "Adonis El Fakih" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 27, 2001 1:28 PM
Subject: Re: [aseek-users] deciphering delta files
> dear Alex,
>
> Can you give an example of what I should find in the first 4 bytes,
second, etc.. for word that has two results? This was i can understand what
I should expect.
>
> Thanks in advance
> Adonis
>
>
> [EMAIL PROTECTED] ßÊÈ:
>
> > Hello, Adonis
>
> ----- Original Message -----
> From: "Adonis El Fakih" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, March 23, 2001 3:08 PM
> Subject: [aseek-users] deciphering delta files
>
>
> > Hi,
> >
> > I am in the progress of writing a perl program to decipher the delta
> files, so I can selectively remove data in the database.
> >
> > In an earlier post I was told that the file formats are as follows:
> >
> > Information about URLs and sites where word is encountered is kept in
> > either BLOB wordurl.urls or in binary file var//NNw/W
> > where W is wordurl.word_id and NN is word_id mod 100.
> > Format of BLOB:
> > Sites section
> > (4) Offset of URLs for site 0
> > (4) Site ID of site 0
> > (4) Offset of URLs for site 1
> > (4) Site ID of site 1
> > ...
> > (4) Offset of URLs for site Max
> > (4) Site ID of site Max
> > (4) Offset of EOF
> > URLs
> > (4) URL ID of URL 0 of site 0
> > (2) Word count
> > (2*word count) Sorted array of word positions
> > (4) URL ID of URL 1 of site 0
> > (2) Word count
> > (2*word count) Sorted array of word positions
> >
> > etc...
> >
> > Let us assume I want to read the first record, should I read 4 and 4 and
> analyze the second 4 bytes? what is the purpose of the 4 offset? I do not
> understand the table above. Also in perl I am using unpack. What format
> should I use?? I tried
>
> "Offset" is the offset of URL info relative to the beginning of
> file/BLOB.
> First four bytes is the offset of URL block and the size of sites
block
> You can read entire sites block and process each site one by one.
Offset
> of beginnig of URL info is stored before site ID, offset of ending of URL
> info is stored after site ID.
>
> > unpack("c", $buf)
> >
> > where $buf was the read using
> >
> > read BLOB, $off,4
> > read BLOB, $buf,4
> >
> > from the beging of the file. I just wanted to see what it contained.
> >
> > I appretiate any hints so I can read the site ids, and records
associated
> with a delta.
> >
> > thanks in advance
> > adonis
> >
>
>