It takes roughly 6 hours for me to index a Gig of data. The benchmarks take
quite a bit less if I'm reading it correctly. I'll try out the
StringBuffer/Builder and let you know. Thanks for the quick response and if
you have any more suggestions please let me know.
--JP
On 8/11/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
> How much slower than anticipated is it?
>
> I would start by using a StringBuffer/Builder rather than appending
> (immutable) strings to each other.
>
>
> 11 aug 2007 kl. 19.05 skrev John Paul Sondag:
>
> > Hi,
> >
> > I was hoping that maybe you guys could see if I'm somehow indexing
> > inefficiently. I'm putting relevant parts of my code below. I've
> > looked at
> > the "benchmarks" page on Lucene and my indexing time is taking a
> > substantial
> > amount of time more than what I see posted. I'm not sure when I
> > should call
> > flush() ( I saw that I should be doing that on the
> > ImproveIndexingSpeed
> > page). I'd really appreciate any advice.
> >
> > Here's my code:
> >
> > File directory = new File( "/mounts/falcon5/disks/0/tcheng3/
> > Dataset");
> > File[] theFiles = directory.listFiles();
> >
> > //go through each file inside the directory and index it
> > for(int curFile = 0; curFile < theFiles.length; curFile++)
> > {
> > File fin=theFiles[curFile];
> >
> > //open up the file
> > FileInputStream inf = new FileInputStream(fin);
> > InputStreamReader isr = new InputStreamReader(inf,
> > "US-ASCII");
> > BufferedReader in = new BufferedReader(isr);
> > String text="";
> > String docid="";
> >
> > while (true) {
> >
> > //read in the file one line at a time, and act accordingly
> > String line = in.readLine();
> > if (line == null) { break;}
> >
> > if (line.startsWith("<DOC>") ) {
> > //get docID
> > line = in.readLine();
> > String tempStr = line.substring(8,line.length());
> > int pos = tempStr.indexOf(' ');
> > docid = tempStr.substring(0,pos);
> > }else if (line.startsWith("</DOC>")) {
> >
> > Document doc = new Document();
> >
> > doc.add(new Field("contents",text,
> > Field.Store.NO,
> > Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
> > doc.add(new Field("DocID",docid, Field.Store.YES,
> > Field.Index.NO));
> > writer.addDocument(doc);
> > text="";
> > } else {
> > text = text + "\n" + line;
> > }
> > }
> >
> > }
> >
> >
> > int numIndexed = writer.docCount();
> >
> > writer.optimize();
> > writer.close();
> >
> >
> > Thanks,
> >
> > --JP
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>