Re: [CODE4LIB] Getting started with SOLR

Erik Hatcher Fri, 23 Nov 2007 01:41:57 -0800

On Nov 22, 2007, at 12:11 PM, Binkley, Peter wrote:

There is a way to pass Solr a path to a file that it can read from
disk
rather than posting the file. I hunted a bit in the wiki and couldn't
find it, though; it may still be a patch you have to apply.


Solr ships with examples that can be posted in using post.sh or
post.jar.  See the README.txt in the example directory.  You can run
it either as:

       post.sh *.xml

  - or -

       java -jar post.jar *.xml

As far as I know there is no way to avoid POSTing in the XML - no
direct import of an XML file without HTTP (without getting down and
dirty and writing to the embedded Solr API, which is a bit
discouraged for many reasons).

You can also bring data into Solr using the CSV importer.  I highly
recommend folks take a good look at this route.  It's clean, easy, fast:
<http://wiki.apache.org/solr/UpdateCSV>

For 700,000 records, one first nice step to try is to convert that
data into CSV and feed it into Solr.  Create a CSV file on the file
system with all those records and use the CSV importer.  I think
you'll find that the absolute fastest way to bring data in.   But
another good way is to POST in the XML, and do a bunch per POST, say
100 or 1000 or so.  Be sure to know how you have Solr set up for
commits also - for a bulk import, don't commit until the end to allow
Solr to operate most efficiently.  You can watch the Solr stats page
to see how many documents have arrived.

       Erik

Peter

-----Original Message-----
From: Code for Libraries [mailto:[EMAIL PROTECTED] On
Behalf Of
Michael Lackhoff
Sent: Thursday, November 22, 2007 1:03 AM
To: [email protected]
Subject: [CODE4LIB] Getting started with SOLR

Hello,

I am just getting my feet wet with SOLR and have a couple of question
how others have done certain things.

I created a schema.xml where basically every field is of type
"text" for
the beginning. Do you use specialized types for authors or ISBNs or
other fields?
How do you handle multi-value fields? Do you feed everything in a
single
field (like "Smith, James ; Miller, Steve" as I have seen in a pure
Lucene implementation of a collegue or do you use the multiValued
feature of SOLR?

What about boosting? I thought of giving the current year a
boost="3.0"
and then 0.1 less for every year the title is older, down to 1.0 for a
21-year-old book. The idea is to have a sort that tends to promote
recent titles but still respects other aspects. Does this sound
reasonable or are there other ideas? I would be very interested in an
actual boosting-scheme from where I could start.

We have a couple of databases that should eventually indexed. Do you
build one huge database with an additional "database" field or is it
better to have every database in its own SOLR instance?

How do you fill the index? Our main database has about 700,000 records
and I don't know if I should build one huge XML-file and feed that
into
SOLR or use a script that sends one record at a time with a commit
after
every 1000 records or so. Or do something in between and split it into
chunks of a few thousand records each? What are your experiences? What
if a record gives an error? Will the whole file be recjected or just
that one record?
Are there alternatives to the HTTP gateway?
Are there any Perl-scripts around that could help? I built a little
script that uses LWP to feed my test records into the database. It
works
but I don't have any error handling yet, very Quick and dirty XML
creation so if there is something more mature I would like to use
that.

Any other ideas, further reading, experiences...?

I know these are a lot of questions but after the conference last
year I
think there is lots of expertise in this group and perhaps I can
avoid a
few beginner mistakes with your help

thanks in advance
- Michael

Re: [CODE4LIB] Getting started with SOLR

Reply via email to