I should say that this is not that much a problem. In our experiment SnakeT clusters +200 snippets taken by ~16 different in ,~2-3 second.
Simply accessing 100-200 snippets can also be quite costly. In most deployments document text will not fit in memory, so 100-200 snippets requires 100-200 disk seeks, or around 1-2 seconds.
Yep, this is a performance issue I also pointed out. It of course becomes a problem when you have a heavily loaded search system -- most intranets simply never reach this phase.
Anyway, Doug, I don't see any faster method of accessing snippets required to build clusters -- can you think of any? I was once thinking about adding keywords (topic-based) to each document, similarily to what was I believe once used in Northern Light search engine, but this is just the opposite of what search results clustering should be (query-specific clustering instead of metadata-view).
Dawid
-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
