Re: document segement size and search performance ?

Andrzej Bialecki Wed, 04 Jun 2008 07:54:17 -0700

wuqi wrote:

Hi,
As we all know, "parse_text" in the segment will be used by searcher to 
generate snippets,and I want to know with the two conditions below which should be faster 
for searcher to retrieve pars_text:
1. 50 Segments * 10,000 pages/segment

2. 5 segment * 100,000 pages/segment

parse_text uses Hadoop MapFile-s. MapFile-s provide fast random accessto individual records because they contain an index of keys (and thefiles themselves are sorted in ascending order of keys). This index(which contains every 128-th key and its position) is fully loaded inmemory, and when you want to get a particular record, first this indexis searched (using binary search) to determine the correct "region" of aMapFile, and then the region itself is loaded from the disk and searched.

This means that extremely large MapFile-s may consume a lot of memory(though this can be adjusted by changing the index interval).

However, "large" usually means record counts in the order of millions.Let's do a quick calculation - assuming the keys here are URLs, each keytakes ~50 bytes on average. We load every 128-th key + plus its offsetas a long (8 bytes). This means that for 1 mln keys the memoryconsumption due to the MapFile index will be ~5MB.

This in turn means that below a certain size (and this threshold is inthe order of a few million records or so) it's better to use a singlesegment instead of multiple segments with the same total number of records.

If we have more segments and less pages per segment ,seems we need to open more 
segment files,and hence more memory? If more pages in a segment,we might need 
more time to get certain page out? Find a page from 10,000 pages should be 
faster than 100,000 pages ?
For a search engine which have about 10M documents, how many segments dir 
should I have ?

Perhaps around 10. Segments larger than 1 mln documents are somewhatinconvenient to process - fetching takes a long time, and if somethinggoes wrong then you lose a large chunk of data.

You can also split your segments along a different criteria, e.g. onesegment per day, or per week.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: document segement size and search performance ?

Reply via email to