If possible you should also test the soon-to-be-released version 2.3, which has a number of speedups to indexing.

Also try the steps here:

  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

You should also try an A/B test: A) writing your index to the NFS directory and then B) to a local IO system, to see how much NFS is really slowing you down.

Mike

Erick Erickson wrote:

This seems really clunky. Especially if your merge step also optimizes.

There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive for sure.
And I believe there's a RAM usage based flushing policy somewhere.

You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. "It would be faster" is a weak argument when
you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.

There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) "We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil." It's true.

So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be *far*
better off doing this then just copying your index to the nfs mount.

Best
Erick

On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote:

In a distributed enviroment the application should make an exhaustive use
of
the network and there is not another way to access to the documents in a
remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit (I have put about 10 Mb) then I serialize to disk(nfs) the index to merge it with
the central index(the central index is in nfs file system), is that
correct?
I hope you can help me.
I have take in consideration the suggestions you have make me before, I
going to do some things to test it.
Ariel


On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:

Thanks all you for yours answers, I going to change a few things in my
application and make tests.
One thing I haven't find another good pdfToText converter like pdfBox Do
you know any other faster ?
Greetings
Thanks for yours answers
Ariel


On Jan 9, 2008 11:08 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

Ariel,

I believe PDFBox is not the fastest thing and was built more to handle all possible PDFs than for speed (just my impression - Ben, PDFBox's
author
might still be on this list and might comment). Pulling data from NFS
to
index seems like a bad idea.  I hope at least the indices are local
and not
on a remote NFS...

We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
one)
and indexing overNFS was slooooooow.

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Ariel <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?

Hi:
I have seen the post in

http://www.mail-archive.com/[EMAIL PROTECTED]/ msg12700.html
 and
I am implementing a similar application in a distributed enviroment, a
cluster of nodes only 5 nodes. The operating system I use is
 Linux(Centos)
so I am using nfs file system too to access the home directory where
 the
documents to be indexed reside and I would like to know how much time
 an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512 Mb in
 every
nodes, LAN: 1Gbits/s.

The problem I have is that my application spends a lot of time to
index
 all
the documents, the delay to index 10 gb of pdf documents is about 2
 days (to
convert pdf to text I am using pdfbox) that is of course a lot of
time,
others applications based in lucene, for instance ibm omnifind only
 takes 5
hours to index the same amount of pdfs documents. I would like to find
 out
why my application has this big delay to index, any help is welcome.
Dou you know others distributed architecture application that uses
 lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings




------------------------------------------------------------------- --
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to