Re: [CODE4LIB] indexing word documents using solr

Eric Lease Morgan Wed, 11 Feb 2015 06:53:19 -0800

On Feb 10, 2015, at 11:46 AM, Erik Hatcher <[email protected]> wrote:


> bin/post -c collection_name /path/to/file.doc

The almost trivial command to index a Word document in Solr, above, is most 
certainly appealing, but I’m wondering about the underlying index’s schema.

Tika makes every effort to extract as much metadata from Word documents as 
possible. This metadata includes dates, titles, authors, names of applications, 
last edit, etc. Some of this data can be very useful. The metadata can be 
packaged up as an XML file/stream and then sent to Solr for indexing. "Tastes 
great. Less filling.” But my question is, “To what degree does Solr know what 
to do with the metadata when the (kewl) command, above, is seemingly so 
generic? Does one need to create a Solr schema to specifically accommodate the 
Tika-created metadata, or do such things also come for ‘free’?”

— 
Eric Morgan

Re: [CODE4LIB] indexing word documents using solr

Reply via email to