Thank Andrzej for your so detailed answer!! ----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Wednesday, June 04, 2008 10:53 PM Subject: Re: document segement size and search performance ?
> wuqi wrote: >> Hi, >> As we all know, "parse_text" in the segment will be used by searcher to >> generate snippets,and I want to know with the two conditions below which >> should be faster for searcher to retrieve pars_text: >> 1. 50 Segments * 10,000 pages/segment >> 2. 5 segment * 100,000 pages/segment > > parse_text uses Hadoop MapFile-s. MapFile-s provide fast random access > to individual records because they contain an index of keys (and the > files themselves are sorted in ascending order of keys). This index > (which contains every 128-th key and its position) is fully loaded in > memory, and when you want to get a particular record, first this index > is searched (using binary search) to determine the correct "region" of a > MapFile, and then the region itself is loaded from the disk and searched. > > This means that extremely large MapFile-s may consume a lot of memory > (though this can be adjusted by changing the index interval). > > However, "large" usually means record counts in the order of millions. > Let's do a quick calculation - assuming the keys here are URLs, each key > takes ~50 bytes on average. We load every 128-th key + plus its offset > as a long (8 bytes). This means that for 1 mln keys the memory > consumption due to the MapFile index will be ~5MB. > > This in turn means that below a certain size (and this threshold is in > the order of a few million records or so) it's better to use a single > segment instead of multiple segments with the same total number of records. > > >> If we have more segments and less pages per segment ,seems we need to open >> more segment files,and hence more memory? If more pages in a segment,we >> might need more time to get certain page out? Find a page from 10,000 pages >> should be faster than 100,000 pages ? >> For a search engine which have about 10M documents, how many segments dir >> should I have ? > > Perhaps around 10. Segments larger than 1 mln documents are somewhat > inconvenient to process - fetching takes a long time, and if something > goes wrong then you lose a large chunk of data. > > You can also split your segments along a different criteria, e.g. one > segment per day, or per week. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com >
