Re: [Nutch-general] Reduce segment size

Andrzej Bialecki Fri, 19 Jan 2007 12:22:39 -0800

Ledio Ago wrote:
> Hi there!
>  
> After a crawl/index cycle a segment directory is created which usually
> contains content, index, and so on directories.
> Here is what actually my current segment directory has after crawl/index
> build of 2 Million URLs:
>  
> /segments/20070114151631> du -sh *
> 9.6G    content
> 212M    fetcher
> 5.0G    index
> 0       index.done
> 5.8G    parse_data
> 3.7G    parse_text
>


Searcher needs content only to display a cached preview of the page. It 
doesn't need it for anything else. It doesn't need "fetcher" either. So, 
the only parts that you have to copy is the index, parse_data and 
parse_text.

(BTW. if you deploy only the index, you will be able to search and see 
the title of the document and url, because they are stored in the index, 
but you won't be able to get "snippets", i.e. fragments of matching 
text, because this comes from parse_text).

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Reduce segment size

Reply via email to