Thank Andrzej for your so detailed answer!!

----- Original Message ----- 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, June 04, 2008 10:53 PM
Subject: Re: document segement size and search performance ?


> wuqi wrote:
>> Hi,
>> As we all know, "parse_text" in the segment will be used by searcher to 
>> generate snippets,and I want to know with the two conditions below which 
>> should be faster for searcher to retrieve pars_text:
>> 1. 50 Segments * 10,000 pages/segment
>> 2. 5 segment * 100,000 pages/segment 
> 
> parse_text uses Hadoop MapFile-s. MapFile-s provide fast random access 
> to individual records because they contain an index of keys (and the 
> files themselves are sorted in ascending order of keys). This index 
> (which contains every 128-th key and its position) is fully loaded in 
> memory, and when you want to get a particular record, first this index 
> is searched (using binary search) to determine the correct "region" of a 
> MapFile, and then the region itself is loaded from the disk and searched.
> 
> This means that extremely large MapFile-s may consume a lot of memory 
> (though this can be adjusted by changing the index interval).
> 
> However, "large" usually means record counts in the order of millions. 
> Let's do a quick calculation - assuming the keys here are URLs, each key 
> takes ~50 bytes on average. We load every 128-th key + plus its offset 
> as a long (8 bytes). This means that for 1 mln keys the memory 
> consumption due to the MapFile index will be ~5MB.
> 
> This in turn means that below a certain size (and this threshold is in 
> the order of a few million records or so) it's better to use a single 
> segment instead of multiple segments with the same total number of records.
> 
> 
>> If we have more segments and less pages per segment ,seems we need to open 
>> more segment files,and hence more memory? If more pages in a segment,we 
>> might need more time to get certain page out? Find a page from 10,000 pages 
>> should be faster than 100,000 pages ?
>> For a search engine which have about 10M documents, how many segments dir 
>> should I have ?
> 
> Perhaps around 10. Segments larger than 1 mln documents are somewhat 
> inconvenient to process - fetching takes a long time, and if something 
> goes wrong then you lose a large chunk of data.
> 
> You can also split your segments along a different criteria, e.g. one 
> segment per day, or per week.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Reply via email to