On Nov 23, 2007, at 7:50 AM, Michael Lackhoff wrote:
You can also bring data into Solr using the CSV importer. I highly
recommend folks take a good look at this route. It's clean, easy,
fast:
<http://wiki.apache.org/solr/UpdateCSV>
That sounds like what I need. Only problem I see: what about
escapes? I
don't know my data good enough to be sure that any possible delimiter
will never occur within the data. Most exotic characters will probably
be errors but I still don't want SOLR to choke on it.
Can I use escapes for separator and/or encapsulator? If so is it \" or
"" (backslash or doubling)? I found nothing in the docs about it.
I prefer tab-delimited files myself. Tabs are those worthless
characters that actually hold great value as a separator, only as a
field separator. Ever.
At the bottom of that wiki page you'll see how to do it with tab
delimited files. But as you're creating that data, ensure that your
field data is void of tabs except as a separator. Then you're in
business. That beats having to worry about quotes and commas and
escaping.
For 700,000 records, one first nice step to try is to convert that
data into CSV and feed it into Solr. Create a CSV file on the file
system with all those records and use the CSV importer. I think
you'll find that the absolute fastest way to bring data in. But
It even looks like the direct way (almost) without HTTP since the file
is read directly from the file system and doesn't have to be squeezed
through a socket connection.
Right - you can feed the CSV data to it as a path to a local (to the
Solr server) file, or stream the data in via HTTP.
But again, don't concern yourself too much at this stage with the
overhead of HTTP. Most have found it not to be the bottleneck in big
indexing, especially since the indexer code is running on the Solr
server itself or on a local network. But the CSV route takes that
out of the picture and provides a very clean and flexible conduit
into Solr.
I call it "column separated values", now that I've finally understood
the real reason for the existence of tabs.
Erik