One of the biggest bangs you can get out of indexing is to multi-
thread it pretty heavily.  Solr can accept lots of simultaneous
connections.

Folks worried about the HTTP communication are typically those that
are new to Solr and see that as a bottleneck without measuring.
Those that have done the measuring have found that HTTP is really a
huge factor on indexing performance.

When Lucene 2.3 is dropped into Solr (coming soon), indexing speed
will improve even more substantially thanks to core improvements at
the heart.

But, POSTing more than one document at a time also good advice.

       Erik



On Nov 23, 2007, at 7:03 AM, Ewout Van Troostenberghe wrote:
How do you fill the index? Our main database has about 700,000
records
and I don't know if I should build one huge XML-file and feed
that into
SOLR or use a script that sends one record at a time with a
commit after
every 1000 records or so. Or do something in between and split it
into
chunks of a few thousand records each? What are your experiences?
What
if a record gives an error? Will the whole file be recjected or just
that one record?
There is a Java command line tool or you can see the VuFind's
solution. If you can, I suggest you to prefer a pure java
solution, writing directly to the Solr index (with the Solr
API), because its much-much more quicker than the PHP's
(Rail's, Perl's) solution which based on a web-service
(which need the PHP parsing and HTTP request curve).
The PHP solution does nothing with Solr directly, it
use the web service, and all the code can be rewriten
in Perl.

When you want use a scripting language to fill the solr index, rather
then using solr API directly, you should consider buffering as an
intermediate solution. It can speed up indexing with orders of
magnitude. Create your XML in the script, and keep them in-memory
until
you have 50 or 100 documents. Then post these together.

Attached is a small ruby script we use to do solr indexing. It reads
yaml records from standard input, does some processing (burried in our
libraries), buffers the result and posts after 100 records are
gathered.

Regards,
Ewout
<index_solr_fast.rb>

Reply via email to