On Friday 06 November 2009 22:44:52 Evan Daniel wrote:
> On Fri, Nov 6, 2009 at 1:45 PM, Matthew Toseland
> <[email protected]> wrote:
> > Now, with Library searches, currently we show many progress bars, and quite 
> > a bit of text. We should really show a simple progress bar, but there are 
> > several complications: We can search across multiple indexes, and in each 
> > index we can have multiple subindex fetches (or index fetches) going on. 
> > The main index fetches should be satisfied from cache, and the files should 
> > usually be small, so normally each word fetch under the same index fetches 
> > the main index simultaneously, and then fetches the subindex in one stage, 
> > so combining the bars for multiple subindexes within an index should be 
> > pretty easy - we probably want two stages but we could simplify it to one 
> > stage. However, we can search multiple indexes at once, which can be at 
> > different stages. We might want to keep two progress bars in this 
> > (currently rare) case. Plus, the index parsing stage for each subindex can 
> > be very slow, and can occur while another subindex is still fetching. The 
> > combining and formatting stages are a bit faster but can still take a few 
> > seconds. We can resolve this by some fudging ... Having said that, I dunno 
> > how dumb users actually are - is it such a big deal that searching for 
> > multiple words in multiple indexes means multiple things happening at once? 
> > :)
> 
> I think most of our users are smart enough to figure it out :)
> 
> There's the other question, though: why does it take 5-10 *seconds* to
> go from having the index downloaded to displaying results?  It's only
> a few tens of MB of data, and it's in a fairly simple format.  Several
> seconds to parse that on a multi-core, multi-GHz machine is abysmal.
> 
> Does Library really need such a huge memory footprint?  Freenet can
> decode the file fine without more memory, but Library regularly OOMs
> on large indexes with the default memory limits, even though it isn't
> actually using that large a piece of the file.

Well, the extreme case is the wAnna? indexes. Many of these are tens of 
megabytes. Search for stupid in wanna. Library appears (from stack traces, not 
done real profiling) to spend most of its time in 
LibrarianHandler.startElement. Searching for "stupid" gives index_f.xml.

It spends a lot of its time here:
L169: Integer.parseInt(attrs.getValue("wordCount"));

Which of course throws because there is no such value. Every time!

This is fixed in Library version 3, which speeds up parsing considerably.

However, it does still use a lot of memory when doing a multi-word search in 
wAnnA. 

AFAICS this is ultimately because the format requires us to make a hashtable of 
all the <file>'s - this just takes up a lot of space for such oversized files. 
Fortunately the spider does not generate these any more: freenetindex has no 
crazy sized subindexes.

infinity0's btree format improves on this considerably: There may be more 
fetches but the size of each fetch is much more predictable.

If I stick to freenetindex, it doesn't OOM so much, even with long phrase 
searches. "international day for the rights of the workers", "If we do do SoC 
this year we will probably take fewer students" etc, proceeed repeatably 
without long periods in Full GC.

I want to remove wanna from the default search indexes. It will be reinstated 
if and when there is a new version that doesn't include 40MB subindexes. Any 
objections?

However, freenetindex is much smaller than wanna ... we really need to fix 
XMLSpider (so it works with current code), and to make it generate indexes 
without blocking for days on end (by the periodic linear rewriting proposal). 
The fact that I can search for the following monster without OOMing 
demonstrates this:

"Review third party code, help with porting FMS to java if necessary, ship 
0.7.0 with a bundled, working java port of FMS, probably with a web interface 
based on Worst."

"with" is not regarded as a stop word here!

However, the search fails, probably because of mishandling 0.7.0. The first and 
last parts work fine.
> 
> Evan Daniel

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Reply via email to