On Sunday 23 March 2008 09:53, you wrote: > Hello, I would appreciate further details about XMLSpider and > XMLLibrarian. I would like to know what index format is used for they. > Is it http://wiki.freenetproject.org/AnotherFreenetIndexFormat?
Hi! Glad you could get back to us this year. Yes, it's roughly AnotherFreenetIndexFormat. You can see a live sample here, if you have a working node: http://127.0.0.1:8888/USK at 5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/11/index.xml?type=text/plain It's quite primitive at the moment - there is no support for distributed spidering, adjacent word match support, page ranking etc. Basically, the format: the index.xml is like this: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <main_index> <prefix value="3"/> <header> <title>XMLSpider index</title> <owner>Freenet</owner> </header> <keywords> <subIndex key="000"/> <subIndex key="001"/> ... The subindexes are just prefixes of the hex version of the md5 of the keywords. They can vary in length, depending on how words are distributed across the keyspace, prefix is the maximum length; this is the only advantage to listing them explicitly, but IMHO it is of some value (and they'll be listed in the metadata manifest anyway, since we insert the index as a freesite). Now fetch a subindex, for example: http://127.0.0.1:8888/USK at 5hH~39FtjA7A9~VXWtBKI~prUDTuJZURudDG0xFn3KA,GDgRGt5f6xqbmo-WraQtU54x4H~871Sho9Hz6hC-0RA,AQACAAE/Search/11/index_022.xml?type=text/plain This looks like this: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <sub_index> <entries value="27"/> <header> <title>XMLSpider index</title> </header> <files> <file id="14255" key="CHK at mAqvNEZFqAGq9VAvvHEeGNtY~ULqsz3ttUfgnNBHEIw,Pb53HeX-rhSRx~sFXpoi88S3fVK3gORqtQB7qsngRqE,AAICAAA/msg04335.html" title="RE: Bit Rate data collection"/> <file id="11647" key="CHK at 8fa1x3V~LQqoF4jP7mOtXslAvhqtXtDoy7oVUXWcAtI,i4BAfYxR-RVawCCmov91HhAfYZ593rOC8u0PMnBngmk,AAICAAA/spring2007.html" title="elenafilatova.com"/> <file id="33666" key="CHK at ADEQnkRksATJ7tarSqS5wIEzD-~von1KtkpFuupeE20,pI7C-MTOl1tbXYTplQ5jOI3A1VWSiCj3dbsL2IBTxgY,AAIC--8" title="ADS's Stories"/> <file id="34427" key="CHK at vMUR6MIxi~SeMuIZfza5MxiVouaXpmzBlNnFEqyTBNY,q1qVo2VhrZc1TCrMiTP2Nk-dHyBolh2W6aaeDYbFcz4,AAIC--8" title="Erotic_keys's Stories"/> ... </files> <keywords> <word v="buyyouadrankremix"> <file id="14255">23510</file> </word> <word v="metaphysical"> <file id="11647">414</file> <file id="33666">1877</file> <file id="34427">259</file> </word> ... </word> </keywords> </sub_index> As you can see, each file has an ID, we list the file IDs which the sub-index refers to at the top of the sub-index, and then we have a list of words, and for each word we list the file IDs which contain it, and the word offset within the file where each occurs. (Sadly the owner of the wAnnA index seems to have disappeared, and this is the last version he inserted; blocks are steadily disappearing as it isn't used much; chicken and egg problem, there's no point adding more obvious ways to get to the spider until somebody maintains an index!). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20080324/24b61d23/attachment.pgp>