If possible you should also test the soon-to-be-released version 2.3,
which has a number of speedups to indexing.
Also try the steps here:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
You should also try an A/B test: A) writing your index to the NFS
directory and then B) to a local IO system, to see how much NFS is
really slowing you down.
Mike
Erick Erickson wrote:
This seems really clunky. Especially if your merge step also
optimizes.
There's not much point in indexing into RAM then merging explicitly.
Just use an FSDirectory rather than a RAMDirectory. There is *already*
buffering built in to FSDirectory, and your merge factor etc. control
how much RAM is used before flushing to disk. There's considerable
discussion of this on the Wiki I believe, but in the mail archive
for sure.
And I believe there's a RAM usage based flushing policy somewhere.
You're adding complexity where it's probably not necessary. Did you
adopt this scheme because you *thought* it would be faster or because
you were addressing a *known* problem? Don't *ever* write complex code
to support a theoretical case unless you have considerable certainty
that it really is a problem. "It would be faster" is a weak
argument when
you don't know whether you're talking about saving 1% or 95%. The
added maintenance is just not worth it.
There's a famous quote about that from Donald Knuth
(paraphrasing Hoare) "We should forget about small efficiencies,
say about 97% of the time: premature optimization is the root of
all evil." It's true.
So the very *first* measurement I'd take is to get rid of the in-RAM
stuff and just write the index to local disk. I suspect you'll be
*far*
better off doing this then just copying your index to the nfs mount.
Best
Erick
On Jan 10, 2008 10:05 AM, Ariel <[EMAIL PROTECTED]> wrote:
In a distributed enviroment the application should make an
exhaustive use
of
the network and there is not another way to access to the
documents in a
remote repository but accessing in nfs file system.
One thing I must clarify: I index the documents in memory, I use
RAMDirectory to do that, then when the RAMDirectory reach the limit
(I have
put about 10 Mb) then I serialize to disk(nfs) the index to merge
it with
the central index(the central index is in nfs file system), is that
correct?
I hope you can help me.
I have take in consideration the suggestions you have make me
before, I
going to do some things to test it.
Ariel
On Jan 10, 2008 8:45 AM, Ariel <[EMAIL PROTECTED]> wrote:
Thanks all you for yours answers, I going to change a few things
in my
application and make tests.
One thing I haven't find another good pdfToText converter like
pdfBox Do
you know any other faster ?
Greetings
Thanks for yours answers
Ariel
On Jan 9, 2008 11:08 PM, Otis Gospodnetic
<[EMAIL PROTECTED]>
wrote:
Ariel,
I believe PDFBox is not the fastest thing and was built more to
handle
all possible PDFs than for speed (just my impression - Ben,
PDFBox's
author
might still be on this list and might comment). Pulling data
from NFS
to
index seems like a bad idea. I hope at least the indices are local
and not
on a remote NFS...
We benchmarked local disk vs. NFS vs. a FC SAN (don't recall which
one)
and indexing overNFS was slooooooow.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Ariel <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, January 9, 2008 2:50:41 PM
Subject: Why is lucene so slow indexing in nfs file system ?
Hi:
I have seen the post in
http://www.mail-archive.com/[EMAIL PROTECTED]/
msg12700.html
and
I am implementing a similar application in a distributed
enviroment, a
cluster of nodes only 5 nodes. The operating system I use is
Linux(Centos)
so I am using nfs file system too to access the home directory
where
the
documents to be indexed reside and I would like to know how much
time
an
application spends to index a big amount of documents like 10 Gb ?
I use lucene version 2.2.0, CPU processor xeon dual 2.4 Ghz 512
Mb in
every
nodes, LAN: 1Gbits/s.
The problem I have is that my application spends a lot of time to
index
all
the documents, the delay to index 10 gb of pdf documents is about 2
days (to
convert pdf to text I am using pdfbox) that is of course a lot of
time,
others applications based in lucene, for instance ibm omnifind only
takes 5
hours to index the same amount of pdfs documents. I would like
to find
out
why my application has this big delay to index, any help is
welcome.
Dou you know others distributed architecture application that uses
lucene to
index big amounts of documents ? How long time it takes to index ?
I hope yo can help me
Greetings
-------------------------------------------------------------------
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]