wuqi wrote:
Hi,
As we all know, "parse_text" in the segment will be used by searcher to 
generate snippets,and I want to know with the two conditions below which should be faster 
for searcher to retrieve pars_text:
1. 50 Segments * 10,000 pages/segment
2. 5 segment * 100,000 pages/segment

parse_text uses Hadoop MapFile-s. MapFile-s provide fast random access to individual records because they contain an index of keys (and the files themselves are sorted in ascending order of keys). This index (which contains every 128-th key and its position) is fully loaded in memory, and when you want to get a particular record, first this index is searched (using binary search) to determine the correct "region" of a MapFile, and then the region itself is loaded from the disk and searched.

This means that extremely large MapFile-s may consume a lot of memory (though this can be adjusted by changing the index interval).

However, "large" usually means record counts in the order of millions. Let's do a quick calculation - assuming the keys here are URLs, each key takes ~50 bytes on average. We load every 128-th key + plus its offset as a long (8 bytes). This means that for 1 mln keys the memory consumption due to the MapFile index will be ~5MB.

This in turn means that below a certain size (and this threshold is in the order of a few million records or so) it's better to use a single segment instead of multiple segments with the same total number of records.


If we have more segments and less pages per segment ,seems we need to open more 
segment files,and hence more memory? If more pages in a segment,we might need 
more time to get certain page out? Find a page from 10,000 pages should be 
faster than 100,000 pages ?
For a search engine which have about 10M documents, how many segments dir 
should I have ?

Perhaps around 10. Segments larger than 1 mln documents are somewhat inconvenient to process - fetching takes a long time, and if something goes wrong then you lose a large chunk of data.

You can also split your segments along a different criteria, e.g. one segment per day, or per week.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to