That would be dependant on your situation and what exactly your trying to accomplish with Nutch.
> Let me clear this up a little: I have a crawl/index machine which crawls and index a fixed list of URLs, no new discovery. The generated index gets copied to searchers. I've setup RamDisk on those boxes and the all index is loaded in memory the index is partitioned). To save memory, and to allow the machines to have more of the index in Ram, I'd like to reduce the size of the segments by removing unnecessary data, that are not used during the searching process. As you can see the original index still remains in the index box and it will be used in the next crawl/index cycle. ... and in this case I will always require the segment data (other then crawl_generate, which can be safely deleted after the fetch is done). > What do you mean the "crawl_generate" data. Are you talking about the "content" directory? Thanks Sean for your input, Ledio ----- Original Message ---- From: Ledio Ago <[EMAIL PROTECTED]> To: [email protected] Sent: Friday, January 19, 2007 1:36:57 PM Subject: RE: Reduce segment size Quick question: It wont affect re-crawling as that's dependant on the Nutch DB, but it will prevent you from re-indexing the data that was deleted as it needs those files. > Why would I want to reindex entries that I've deleted? I have never tried running Nutch "just" with the index file, it might work or it might not but its something to test (move them out of the directory, but don't delete them). ----- Original Message ---- From: Ledio Ago <[EMAIL PROTECTED]> To: [email protected] Sent: Thursday, January 18, 2007 8:57:15 PM Subject: Reduce segment size Hi there! After a crawl/index cycle a segment directory is created which usually contains content, index, and so on directories. Here is what actually my current segment directory has after crawl/index build of 2 Million URLs: /segments/20070114151631> du -sh * 9.6G content 212M fetcher 5.0G index 0 index.done 5.8G parse_data 3.7G parse_text The segment directory is copied to a searcher. As you can see the content directory is huge. My question is, if you just remove this directory, would that affect the search capability, or later the recrawling and reindexing? The content directory is so big, is there is a way not to have to copy that directory to the searcher? Thanks, Ledio Ledio Ago * Sr. Software Engineer * [EMAIL PROTECTED] w: 415-348-7693 * f: 415-348-7032 LookSmart - Where To Look For What You Need. - Find. Save. Share. 625 Second Street, San Francisco, CA 94107 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
