Example of BLOB for word, which is found in 2 URLs from 1 site

Hex                Hex
offset length    value
0        4           C        (points to the first URL of site with ID=1)--+
4        4           1        (site ID)
|
8        4           1C     (points to the end of file)--+
|
C       4           1        (url ID)<------------------+----------------+
10      2           1        (word count)                     |
12      2           1        (word position)                 |
14      4           2        (url ID)                             |
18      2           1        (word count)                    |
1A      2           1       (word position)                 |
<-----------------------------------------------+

Alexander

----- Original Message -----
From: "Adonis El Fakih" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 27, 2001 1:28 PM
Subject: Re: [aseek-users] deciphering delta files


> dear Alex,
>
> Can you give an example of what I should find in the first 4 bytes,
second, etc.. for word that has two results? This was i can understand what
I should expect.
>
> Thanks in advance
> Adonis
>
>
> [EMAIL PROTECTED] ßÊÈ:
>
> > Hello, Adonis
>
> ----- Original Message -----
> From: "Adonis El Fakih" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, March 23, 2001 3:08 PM
> Subject: [aseek-users] deciphering delta files
>
>
> > Hi,
> >
> > I am in the progress of writing a perl program to decipher the delta
> files, so I can selectively remove data in the database.
> >
> > In an earlier post I was told that the file formats are as follows:
> >
> > Information about URLs and sites where word is encountered is kept in
> > either BLOB wordurl.urls or in binary file var//NNw/W
> > where W is wordurl.word_id and NN is word_id mod 100.
> > Format of BLOB:
> > Sites section
> > (4) Offset of URLs for site 0
> > (4) Site ID of site 0
> > (4) Offset of URLs for site 1
> > (4) Site ID of site 1
> > ...
> > (4) Offset of URLs for site Max
> > (4) Site ID of site Max
> > (4) Offset of EOF
> > URLs
> > (4) URL ID of URL 0 of site 0
> > (2) Word count
> > (2*word count) Sorted array of word positions
> > (4) URL ID of URL 1 of site 0
> > (2) Word count
> > (2*word count) Sorted array of word positions
> >
> > etc...
> >
> > Let us assume I want to read the first record, should I read 4 and 4 and
> analyze the second 4 bytes? what is the purpose of the 4 offset? I do not
> understand the table above.  Also in perl I am using unpack.  What format
> should I use?? I tried
>
>     "Offset" is the offset of URL info relative to the beginning of
> file/BLOB.
>     First four bytes is the offset of URL block and the size of sites
block
>     You can read entire sites block and process each site one by one.
Offset
> of beginnig of URL info is stored before site ID, offset of ending of URL
> info is stored after site ID.
>
> > unpack("c", $buf)
> >
> > where $buf was the read using
> >
> > read BLOB, $off,4
> > read BLOB, $buf,4
> >
> > from the beging of the file.  I just wanted to see what it contained.
> >
> > I appretiate any hints so I can read the site ids, and records
associated
> with a delta.
> >
> > thanks in advance
> > adonis
> >
>
>

Reply via email to