True, but we are less than typical. ;) Seriously though, we are using Nutch to conglomerate many small sources in the enterprise of varying shapes and sizes, meaning many indexes (even when we merge as many together as possible). Others using Nutch in the enterprise for internal crawling may face the same challenges.
We are at the edge of the acceptable limit, as our enterprise implementations have a somewhat unusual situation: * Each index has 20 fields (on average - some have 50! - but let's say 20) * We have up to 30 indexes built on one machine, including helper indexes Assuming a worst-case situation of 9 unmerged index-segments, we will get: 30 * 9 * (7 + 20) = 7,290 open files Whereas with compound, it would be: 30 * 9 = 270 open files We are currently considering changing the way we use the indexer so it is incremental (adding a few changed files to the existing index instead of creating a new one) so this will have the effect of indexes not always being optimized, so plenty of segments in each index. Agree about the performance degradation (estimated at 5-10% by Gospodnetic et Hatcher), which only affects the indexing time, not the search time, but we would put this as a clear caveat in the conf file. We'd rather the incremental index process be a little slower (our big performance problem is on parsing anyway), but that the file system work be a little more manageable. Are there any objections? Best regards, Alan _________________________ Alan Tanaman iDNA Solutions -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: 02 January 2007 13:07 To: nutch-dev@lucene.apache.org Subject: Re: Creating Lucence Compound Index Alan Tanaman wrote: > Currently Nutch creates a Lucene multifile index, and makes sure any > existing compound index is converted to multifile by using the > IndexWriter.setUseCompoundFile(false) method. > > > > This is done whenever an IndexWriter is opened in the following methods: > > org.apache.nutch.indexer.Indexer.getRecordWriter > > org.apache.nutch.indexer.IndexSorter.sort > > org.apache.nutch.indexer.IndexMerger.merge > > > > Is there a technical constraint as to why Nutch should ensure usage of > multifile (or prevent compound) and not allow the type to be set by a > property setting? > > > > Does anyone object to/support a patch to allow this to be configurable? > > > Multifile indexes are somewhat faster, and require much less temporary space during indexing. Why would you want to use the compound format with Nutch? The typical use of Nutch is that you work with a single or at most couple (few) indexes per machine - in such case, regular non-compound index works better, and there is no danger of running out of file handles. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers