On Friday 15 Aug 2003 14:10, Niklas Bergh wrote:

> linuximage.iso is much more easily indexable than linux-howto.html since
> the only information you would index is its name and maybe size and some
> other easily accessible properties (unless the filter is very
> sophisticated) while people would probably expect an html indexer to
> have a look at the content which is much more work to code.

OK, I see what you mean. That is basically what Google does for it's image 
search. It associates the content of the text in the link with the file that 
is being linked to, and uses that for classifying and indexing non-textual 
content.

> > The two concepts are actually not quite as similar as you may
> > think. They have
> > very different priorities. Saying "find me documents about x,
> > y, z" means go
>
> I assume you mean "find me documents which has the strings x, y, z
> inside"?

Correct.

> > find files with this content in them, and order them in some
> > sensible way.
> >
> > Saying "find me files whose names are something like x y z" is quite
> > different. The indices would be very different and the
> > indexing mechanisms
> > would be different. While you could use a Google style search
> > engine for
> > files, the fundamental different is that you are indexing on
> > CONTENT rather
> > than names or meta-data.
>
> Every search engine I know about searches for strings associated with a
> certain dataitem. Sometimes the strings are found in the meta-data,
> othertimes inside the actual file. Both is useful, it is just a matter
> of combining the hits in a sensible way or allowing the user to specify
> what the query engine should do for him.

Actually, meta tags in HTML have been widely abused, and one of the main 
reasons why Google works better is because it actually compares them to the 
actual viewable content in the page.

> > > I wouldn't want to link the concept of seach indexes directly to
> > > freesites.
> >
> > No of course not. But it would be a method by which Freesite
> > owners could help
> > ensure that their site is indexed properly. Unfortunately, as
> > everything
> > else, this could be abused because there is no way to ensure
> > that index
> > corresponds to the site. The only reliable way to index
> > content is to crawl
> > it.
> >
> > Also note that Freesite owners would probably prefer a full
> > crawl of their
> > site to take place, because it would help propagate their
> > content within the
> > network, thus there is no real incentive for them to create
> > an index file
> > (they get more benefit from there not being one).
>
> If I don't want to run a crawler I would very much prefer the site to
> provide a pre-generated and ready-to-use index. There beeing an
> pre-generated index of the site might even cause me to propagate pages
> through plain old fproxy request because I can now find them more
> easily.

Interesting idea, but you would find them just as easily if the site was fully 
crawled for the index. As an end user you don't want to crawl it yourself, 
but if somebody has a site with a large and comprehensive index, that is a 
much more useful resource.

> > It is important to use the correct tool for the job. If we
> > are trying to come
> > up with a Google type search engine, then let's focus on
> > indexing html and
> > text pages. Leave file sharing to the tools that are designed for it.
>
> I don't agree. I would like the see a technology that could be used more
> widely. I would not like to see a google type seach engine. I would like
> to see a standardized something that one could build a google type
> search engine *upon*.

OK, I see what you mean. As far as I see, that core component would be a 
compact, scalable, sparsely searchable, shallow tree index.

> > > This enables people to act as 'index
> > > publishers' and each and every user could choose whose indexes to
> > > 'search in'/'merge into their own local index' and trust, much like
> > > todays index pages....
> >
> > Sort of. The more segmented/limited the indices are, the less
> > useful they are.
> > The index that knows about more content is going to be the
> > index that more
> > people use. This pretty much sinks the concept of "I'll only
> > index these sites",
>
> TFE has more links than TFEE. Still I mostly use TFEE because I am not
> interested in some of the stuff that the TFE page provides.

That is beside the point. Once a clever index storage design can be found, 
anybody who feels like it can run their own search engine site.

Gordan
_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Reply via email to