[freenet-dev] XML format

Matthew Toseland Mon, 24 Mar 2008 17:39:30 +0000

On Sunday 23 March 2008 09:53, you wrote:
> Hello, I would appreciate further details about XMLSpider and
> XMLLibrarian. I would like to know what index format is used for they.
> Is it http://wiki.freenetproject.org/AnotherFreenetIndexFormat?


Hi! Glad you could get back to us this year. Yes, it's roughly 
AnotherFreenetIndexFormat. You can see a live sample here, if you have a 
working node:

http://127.0.0.1:8888/USK at 
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/11/index.xml?type=text/plain

It's quite primitive at the moment - there is no support for distributed 
spidering, adjacent word match support, page ranking etc.

Basically, the format: the index.xml is like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<main_index>
<prefix value="3"/>
<header>
<title>XMLSpider index</title>
<owner>Freenet</owner>
</header>
<keywords>
<subIndex key="000"/>
<subIndex key="001"/>
...

The subindexes are just prefixes of the hex version of the md5 of the 
keywords. They can vary in length, depending on how words are distributed 
across the keyspace, prefix is the maximum length; this is the only advantage 
to listing them explicitly, but IMHO it is of some value (and they'll be 
listed in the metadata manifest anyway, since we insert the index as a 
freesite).

Now fetch a subindex, for example:

http://127.0.0.1:8888/USK at 
5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/11/index_022.xml?type=text/plain

This looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<sub_index>
<entries value="27"/>
<header>
<title>XMLSpider index</title>
</header>
<files>
<file id="14255" 
key="CHK at 
mAqvNEZFqAGq9VAvvHEeGNtY~ULqsz3ttUfgnNBHEIw,Pb53HeX-rhSRx~sFXpoi88S3fVK3gORqtQB7qsngRqE,AAICAAA/msg04335.html"
 
title="RE: Bit Rate data collection"/>
<file id="11647" 
key="CHK at 
8fa1x3V~LQqoF4jP7mOtXslAvhqtXtDoy7oVUXWcAtI,i4BAfYxR-RVawCCmov91HhAfYZ593rOC8u0PMnBngmk,AAICAAA/spring2007.html"
 
title="elenafilatova.com"/>
<file id="33666" 
key="CHK at 
ADEQnkRksATJ7tarSqS5wIEzD-~von1KtkpFuupeE20,pI7C-MTOl1tbXYTplQ5jOI3A1VWSiCj3dbsL2IBTxgY,AAIC--8"
 
title="ADS's Stories"/>
<file id="34427" 
key="CHK at 
vMUR6MIxi~SeMuIZfza5MxiVouaXpmzBlNnFEqyTBNY,q1qVo2VhrZc1TCrMiTP2Nk-dHyBolh2W6aaeDYbFcz4,AAIC--8"
 
title="Erotic_keys's Stories"/>
...
</files>
<keywords>
<word v="buyyouadrankremix">
<file id="14255">23510</file>
</word>
<word v="metaphysical">
<file id="11647">414</file>
<file id="33666">1877</file>
<file id="34427">259</file>
</word>
...
</word>
</keywords>
</sub_index>

As you can see, each file has an ID, we list the file IDs which the sub-index 
refers to at the top of the sub-index, and then we have a list of words, and 
for each word we list the file IDs which contain it, and the word offset 
within the file where each occurs.

(Sadly the owner of the wAnnA index seems to have disappeared, and this is the 
last version he inserted; blocks are steadily disappearing as it isn't used 
much; chicken and egg problem, there's no point adding more obvious ways to 
get to the spider until somebody maintains an index!).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20080324/24b61d23/attachment.pgp>

[freenet-dev] XML format

Reply via email to