Re: [freenet-dev] freenet (pre-)searchengine

Gordan Thu, 14 Aug 2003 23:25:26 -0700

On Friday 15 August 2003 00:35, Newsbite wrote:
> On Thursday 14 August 2003 22:44, Newsbite wrote:
> > Anyway, what I was thinking was, that there are javascripts (and probably
> > other stuff as well :-) that can emulate a searchengine. The database is
> > stored as part of the javascript on a webpage, thus, and is readily (and
> > very fast) at showing the index of the links (of the word(s) that were
> > requested).
>
> Actually, you would want to keep the data and the code completely separate.
> The index would be enormous. You would also have to implement some very
> unorthodox indexing methods to create an index that would scale to any
> extent with an underlying storage medium such as Freenet.
>
> For most standard indexing mechanisms used today, when they are applied to
> Freenet, they fall apart because of the nature of access. When you try to
> make things future proof and scalable up to, say 3bn pages, it all becomes
> infeasible.
>
> Well, agreed: this is another drawback I was aware off (forgot to mention
> it, though). the system would only work good upto medium sized amounts of
> data; hundreds and thousands of links are evisable, but millions and
> billions would not be, in all likelyhood. It's worth noting, that it's
> exactly in this range that the current status of Freenet is in, and since
> it's an intermediate solution untill a real searchengine is created, it
> would do. (Thereafter, it could be reduced as a fast-way to get (only)
> other meta-indexes, for instance)


Unfortunately, because of the encryption, the only possible search engine 
mechanism would be of the type that we are discussing here, i.e. 
crawl->index->upload. This means the scalability in such a method with 
regards to the Freenet architecture is essential for a system that is 
expected to work for any length of time in the future.

> I am not sure I understand what you mean by 'unorthodox indexing methods';
> while not extremely efficient, as yet, the normal crawling system that is
> used today would suffice, me thinks. In effect, the underlying system would
> not differ that much from the TFE and the like, only the way in which it is
> presented (and requested) would be different. where the TFE is like one
> giant page full of links, with my concept it would be far more google-like
> (at least, in appearence). I mean: just a little window or field to type
> your searchwords in, klick 'search', and get a bunch of links (browser
> retrieved from the java-script itself) which contain the key-words.

OK, let's say that the index would take up 100 MB. If you think that 
downloading a 100 MB HTML file (or XML, or CSV if they are separate files) 
into a browser using JavaScript will work, they you have some interesting 
misconceptions about what modern browsers can handle sensibly.

1) If you give IE6 or Mozilla (I'm guessing that you are aiming for DOM-ish 
browsers only) a 100 MB file to process with JavaScript, it is going to go 
away for a very long time.

2) If you make it in such a way that you have to download a 100 MB file to 
perform a query, then that's a non-starter anyway, as that can take hours, 
and has to deal with redundant FEC - again, it could be difficult.

Therefore, you would need a way of segmenting the index so that you could 
search it sparsely, and only download a very small fraction of it, based on 
the search terms.

> > This is not ideal, ofcourse, but it would be an improvement to the
> > current system.
>
> It would - if it could be done efficiently.
>
> I have already the code of the javascript itself. It could be done
> efficiently (with the restriction of handling vast amounts of data)

Precisely my point - how are you going to restrict the "vast amount of data" 
issue?

If you have to download 1 100 MB file, that is bad. If you have to download 
100 1 MB files, that is even worse because of latency issues.

If you could do it so that you could only download 10 100 KB files out of the 
10,000 100 KB files, it might just work.

> > Once again, I've told my idea on IIRC, and it was met rather positively,
> > but with the remarks (which had occured to me also ;-):
> >
> > It still needs someone to insert/retrieve the database.
>
> That is not a big problem. Any fairly standard web crawler would work for
> indexing the pages. Uploading the database is also not an issue. The
> problem is in the database storage format. It is difficult to come up with
> a method that would yield good results and acceptable response times with a
> high-latency network.
>
> I think the last part is not correct. I'm talking about a java-script
> enabled on the clientside (browser). The high-latency would not be an
> issue, thus, once the 'google-like' page (with the javascript/database in
> it) has been retrieved succesfully.

You cannot just embed the ENTIRE database into the HTML page. That page could 
be 100s of MB in size, growing to 1,000s or 1,000,000s of MB in size as the 
number of indexable documents grows. It is not scalable.

> > Wich is true, but that could be said of the current TFE system too.
> > Besides, it can't be that difficult to largely automate the process.
>
> Automating the process would be dead easy. Coming up with a storage format
> that is efficient is difficult. Another difficulty lies in implementing an
> index format which is compact yet useful. There is no point in creating an
> index that would take up as much space as all the data it is trying to
> index. That would be bad, as the index would effectively double the
> required storage capacity of the network.
>
> agreed. The data would not have to be duplicated, however. Only keywords
> (or those short descriptions that you already today can insert) and  the
> (active?) links themselves are needed; the content itself is not really
> necessary.

Indeed, but the search result quality will suffer if you limit the search to 
meta-tags only. For summary, you could use JS to pre-load and parse the 
relevant page upon a match. The good thing about it is that it would 
effectively pre-cache all the pages you are likely to go to. The bad thing is 
that it is rather network intensive, and hence would be very slow.

> > 2)more user friendly
>
> Maybe. You could do it all in JavaScript, as you said. This would, however,
> put most people off because of the filter warnings. A better way to do it
> would be to create a Fred plug-in applet that would perform this function.
> It would probably be faster, and it would work around the problem of filter
> warnings. It would also be "easier" to trust it if it were distributed with
> the node library, rather than just a random page from an inherently
> untrustworthy medium.
>
> Indeed, filterwarnings put people off, that's why I made that suggestion at
> the end. As for your plug-in idea: it may have some value, but alas, I'm an
> (IT) manager and free-lance writer, not a developer. My coding experience
> is very limited; some html, php and javascript, and that's all. So I'm
> afraid somebody else would have to do your suggestion. :-)

I think Matthew suggested the same thing when he said the way to do it may be 
to build it into fproxy.

> > 4)the moral issue is greately
> > reduced; because (links to) 'illegal' things such as copyrighted material
> > (or worse) would only be visible when you actively seek/request it
>
> That is not necessarily strictly true. It depends on how much of specific
> type of content there is. Any automated search engine has such issues. For
> example, how many times have you entered a completely normal, mundane and
> geeky search string into Google/Altavista/Other search engine and found
> that totally unrelated porn pages crop up even on the first results page,
> because some porn site web master put the terms on his page so that it
> would come up for pretty much ANY query you typed in?
>
> True, but it rates the links according to the relevance of the keywords
> that were put in the searchbox. It's a rather simple system, easely
> by-passed, but more complex rating-mechanisms could be used (as google
> does).It will never be fully bulletproof, ofcourse, but nothing will, I
> think. But anyway, the apparent in-your-face visiblity of links to illegal
> material would be gone.

Well, we can hope that is the case, anyway.

> > It would require, however, that at least for this particular script (or
> > for some particular page), the java-script filter would have to let it
> > pass without much fuzz.
>
> Not really. Just leave it to the user to decide whether they trust the
> page. If they do, they can click the "proceed anyway" button. The correct
> way around this would have to be the plug-in applet.
>
> Ah, thanks for the hint. I had the impression the filter actually blocked
> the javascript, but if I understand you correctly, it can be passed, just
> by klicking on it?

Yes, IIRC, you do get the link to "do it anyway".

> are you sure it does not hamper java-script?

You should be OK if you keep all your JS on one page. That way you don't get 
the same issues to deal with in your JS code for the index file(s).

> > Not ideal, perhaps, but untill a true good-working, scalable, anonymous
> > searchengine is created to work in freenet, it would beat everything that
> > is currently available on freenet.
>
> There are many, many more technical difficulties involved in that than you
> may realize, especially in coming up with a good, scalable index format.
>
> On itself, it's rather simple, really. It is however, not unlimited
> scalable, that is true. but I really think, in the short to mid-long term,
> it would be a hit.

It would have to be scalable, i.e. it's speed would have to reduce 
logarithmicaly with the amount of content, and the size of the index itself 
would have to be such that it occupies the tiny fraction of the space 
occupied by the content that it indexes. 10% of the size of the indexed 
content would be bad. Get it down to 1%, and it might just work.

The problem, again, is in the index format.

Gordan
_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] freenet (pre-)searchengine

Reply via email to