On Friday 06 November 2009 22:44:52 Evan Daniel wrote: > On Fri, Nov 6, 2009 at 1:45 PM, Matthew Toseland > <[email protected]> wrote: > > Now, with Library searches, currently we show many progress bars, and quite > > a bit of text. We should really show a simple progress bar, but there are > > several complications: We can search across multiple indexes, and in each > > index we can have multiple subindex fetches (or index fetches) going on. > > The main index fetches should be satisfied from cache, and the files should > > usually be small, so normally each word fetch under the same index fetches > > the main index simultaneously, and then fetches the subindex in one stage, > > so combining the bars for multiple subindexes within an index should be > > pretty easy - we probably want two stages but we could simplify it to one > > stage. However, we can search multiple indexes at once, which can be at > > different stages. We might want to keep two progress bars in this > > (currently rare) case. Plus, the index parsing stage for each subindex can > > be very slow, and can occur while another subindex is still fetching. The > > combining and formatting stages are a bit faster but can still take a few > > seconds. We can resolve this by some fudging ... Having said that, I dunno > > how dumb users actually are - is it such a big deal that searching for > > multiple words in multiple indexes means multiple things happening at once? > > :) > > I think most of our users are smart enough to figure it out :) > > There's the other question, though: why does it take 5-10 *seconds* to > go from having the index downloaded to displaying results? It's only > a few tens of MB of data, and it's in a fairly simple format. Several > seconds to parse that on a multi-core, multi-GHz machine is abysmal. > > Does Library really need such a huge memory footprint? Freenet can > decode the file fine without more memory, but Library regularly OOMs > on large indexes with the default memory limits, even though it isn't > actually using that large a piece of the file.
Well, the extreme case is the wAnna? indexes. Many of these are tens of
megabytes. Search for stupid in wanna. Library appears (from stack traces, not
done real profiling) to spend most of its time in
LibrarianHandler.startElement. Searching for "stupid" gives index_f.xml.
It spends a lot of its time here:
L169: Integer.parseInt(attrs.getValue("wordCount"));
Which of course throws because there is no such value. Every time!
This is fixed in Library version 3, which speeds up parsing considerably.
However, it does still use a lot of memory when doing a multi-word search in
wAnnA.
AFAICS this is ultimately because the format requires us to make a hashtable of
all the <file>'s - this just takes up a lot of space for such oversized files.
Fortunately the spider does not generate these any more: freenetindex has no
crazy sized subindexes.
infinity0's btree format improves on this considerably: There may be more
fetches but the size of each fetch is much more predictable.
If I stick to freenetindex, it doesn't OOM so much, even with long phrase
searches. "international day for the rights of the workers", "If we do do SoC
this year we will probably take fewer students" etc, proceeed repeatably
without long periods in Full GC.
I want to remove wanna from the default search indexes. It will be reinstated
if and when there is a new version that doesn't include 40MB subindexes. Any
objections?
However, freenetindex is much smaller than wanna ... we really need to fix
XMLSpider (so it works with current code), and to make it generate indexes
without blocking for days on end (by the periodic linear rewriting proposal).
The fact that I can search for the following monster without OOMing
demonstrates this:
"Review third party code, help with porting FMS to java if necessary, ship
0.7.0 with a bundled, working java port of FMS, probably with a web interface
based on Worst."
"with" is not regarded as a stop word here!
However, the search fails, probably because of mishandling 0.7.0. The first and
last parts work fine.
>
> Evan Daniel
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Devl mailing list [email protected] http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
