Hi Terry, It's usually useful to give some information about your environment. How many documents you are indexing, what is the average size of a document, etc. But I'll answer anyway :-).
For details about indexing see Otis' article at http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html On Tue, Jan 13, 2004 at 04:35:43PM -0500, Terry Steichen wrote: > I just aborted a re-indexing operation (because it was taking too much time - will > run it overnight instead). But I was surprised by what I found in the index > directory, which contained a total of 1,402 index files! It started out with 36 > files with the name of "_I9a.*", followed by groups of 72 files with names like > "_17si.*" and so forth. > > Is this normal? Yes. As it indexes more documents, Lucene creates more files. > > Also, I noticed that during the indexing it would chug along, indexing at a pretty > decent rate, and then, every so often (I would estimate every several hundred added > files) it would stop for perhaps 10 - 30 seconds (occasionally longer), doing a > bunch of disk activity. Then it would resume again - almost like it was optimizing. > (I'm doing this on a notebook, so the disk IO is probably fairly slow.) > Probably at this point, Lucene is merging some of the files. At this point, if you look you'll probably see that the number of files has been reduced. > Is this normal? Yes. Once the indexing is done you can optimize the index, which will result in one index instead of several hundred. Read the article, it's good stuff. Regards, Dror > > Regards, > > Terry > > PS: The code I'm using to do the indexing is below: > > import npg1.search.WebExecAnalyzer; > import org.apache.lucene.index.IndexWriter; > import npg1.search.WESimilarity2; > import npg1.search.WPDocument2a; > > import java.io.File; > import java.util.Date; > > class IndexWPFiles2a { > public static void main(String[] args) { > > //args[0] = location of target directory to be indexed > //args[1] = location of index directory (in which to create index files) > > System.out.println("starting"); > try { > Date start = new Date(); > > String target = "c:/master_db/master_xml"; > if(args[0] != null) { > target = args[0]; > } > String index = "c:/master_db/master_index"; > if(args[1] != null) { > index = args[1]; > } > > IndexWriter writer = null; > if(args.length < 3) { > writer = new IndexWriter(index, new WebExecAnalyzer(), true); > writer.mergeFactor = 50; > writer.setSimilarity(new WESimilarity2()); > indexDocs(writer, new File(target)); > } else { > writer = new IndexWriter(index, new WebExecAnalyzer(), false); > writer.setSimilarity(new WESimilarity2()); > } > writer.optimize(); > writer.close(); > > Date end = new Date(); > > System.out.print(end.getTime() - start.getTime()); > System.out.println(" total milliseconds"); > > } catch (Exception e) { > System.out.println(" caught a " + e.getClass() + > "\n with message: " + e.getMessage()); > } > } > > public static void indexDocs(IndexWriter writer, File file) > throws Exception { > //System.out.println("starting indexing with internal path"); > if (file.isDirectory()) { > String[] files = file.list(); > for (int i = 0; i < files.length; i++){ > //System.out.println("recursive call"); > indexDocs(writer, new File(file, files[i])); > } > } else { > try { > System.out.println("adding " + file); > writer.addDocument(WPDocument2a.Document(file)); > } catch (Exception e) { > System.out.println("error adding "+file+" - Exception: "+e.getMessage()); > } > } > } > } -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
