Re: [freenet-dev] Index format proposal

Jerome Flesch Fri, 02 Jun 2006 18:01:14 -0700

> > > The main changes I would make to the librarian
> > > format right now would be:
> > > - Support splitting. (This is relevant to file indexes)


I updated my format proposal on 
http://wiki.freenetproject.org/AnotherFreenetIndexFormat to try to fit your 
requirements, but I still need some explanations on this point:
I don't really understand why indexes need to handle file splitting: FCPv2 
specs specify that the node who does most of this work, no ? 


> > > - Include word indexes to allow for adjacent word searches. (This is
> > >   relevant to file indexes too, because you may want to search for
> > >   adjacent words in a title).

Added.

> > > - Maybe include some amount of metadata - functional (mime type), or
> > >   theoretical (category, dublin core...), or other (activelinks?).
> > > (This is definitely relevant to file indexes).

I agree that mime types would be more relevant than my previous "file" tag 
attribute called "type", so I replaced this attribute.

Regarding categories, I still think it would be better to let it as an option 
(e.g. "option" tag in "file" tag), as it will not always be possible for 
spider to find good categories, and as we will probably have some lazy users 
never specifing categories.
But if you think it's really important, I can put it as "file" tag attribute.

Regarding dublin core metadata, as binary files won't have them, I think it's 
better to put them as options too.

Regarding "activelinks", what do you mean exactly ?


> > > - Include the filename in the index. Possibly using negative word
> > >   indexes to indicate "in the filename" words; it must be possible to
> > >   distinguish between matches in the page title and matches in the
> > >   content. (This is also relevant to both web page indexes and file
> > >   indexes, though especially to the latter).
> >
By filename, did you mean document titles ?


> > I will try to update my format proposal as soon as possible (probably
> > this evening) to allow this.
>
> I'm sorry, the above was incomprehensible because of an unforeseen
> double entendre on "word indexes". The next version of the librarian
> index format was to be something like this:
>
> word 32 (17,23,99) 33 (11,-2)
>
> I.e. the word occurs in URIs number 32 and 33. Each of these has a list
> of integers. The integers are the index, or less confusingly position, of
> the word, within the stream of words that is the document. This is a
> counter - the first word is 0, the second word is 1 etc. (Excluding any
> non-text content e.g. html tags). We would use a negative number to
> indicate that the word was not in the content but in the title. (That's
> not implemented in Librarian, I just came up with it).
>
> Does this make the second and last points above make sense?
>
It should be ok :)


> > > and the
> > > length of the file if it is audio or video. Both are perfectly
> > > reasonable extensions IMHO. If we are going to support metadata we
> > > should support a range of metadata; we will need support for a
> > > category, (probably tied to a specific site), at least, and this is a
> > > very woolly and arbitrary thing.
> >
> > I agree that it's a wolly and arbitrary thing, and I think most of the
> > users won't even spend time to define their files categories (that's why
> > I've put this as an option).
> > In fact, in Fuqid replacement, I thought letting user to define himself
> > category for a given file. But for more usability, it would imply to
> > allow user to change category of many files at the same time.
>
> Well, for freesites, I expect categories to be quite important - but the
> category would likely be assigned by a trusted author, such as TFE, or
> at least it would be using a standard scale... or it could just be a
> small amount of free text included by the site owner himself. But I know
> CofE was thinking along the lines of providing a database of sites with
> his own descriptions for them which could then be aggregated...
>
Ok


> > > An explicit aim of your index format is to be able to index the
> > > contents of text-based files by words. This is a good thing, but if you
> > > are going to do that, then you should define a format, (preferably with
> > > some of the details of splitting indexes worked out), and make
> > > Librarian and Spider use it.
> >
> > Ok, so if my next format proposal is right for everybody, I'll try to
> > adapt Librarian and Spider.
>
> Right. Thanks for your thoroughness, I hope that it doesn't result in
> your not having time to ship the primary finished product (the GUI
> searching/sharing tool itself).
>
As I said below, I will only do, in a first time, basic work on Librarian and 
Spider. I don't think I will spend too much time on it. And if I see that 
it's starting to require too much time, I may let it down some time to come 
back on it later (e.g. probably after the summer).


> > Regarding Spider, in a first time, it would only be a basic version /
> > adaptation, only indexing HTML files. As I will need to create a set of
> > filters to extract metadata and words for the Fuqid replacement, I could
> > reuse them later in Spider.
>
> Right. I see no reason why your filesharing tool cannot link directly
> into freesites if they haven't been excluded from the search.
>
I agree. It will only require to know where to find the browser, it shouldn't 
be a problem :)


> > > Metadata can be shown next to matches, or it can be used
> > > to narrow down searches.
> >
> > For Librarian ? Ok, I don't think it will be a real problem.
>
> Yes, for google-style searches. It might be worth thinking about for
> filesharing type searches too.
>
Ok.


-- 
Jerome Flesch.
_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Index format proposal

Reply via email to