Hi all,

Since implementing Evergreen, the consortia I work with have added several custom indexes, mostly for the keyword class, to address the following issues:

* Add weights for particular indexes (title, subject, author) for a keyword search. * Add MARC fields that are not covered by MODS or are excluded from the default keyword index (aka the blob).
* Add indexes for 880 fields.

Although we love the way the system is flexible enough to allow us to make these adjustments, we've come across a few problems that has us re-thinking the way we set up our indexes. I wanted to send along some of the problems we've encountered and the solutions we're considering to see if you all have feedback on our potential solutions, if you see drawbacks we hadn't considered, or if you have alternative approaches that might work.

Some of the issues we've encountered:

* When adding an index to add a MARC tag that is not covered by the keyword blob, we run into the problem where these indexes are not "combined." For example, NOBLE added a local field (962) for captions of digital images to its records and then added this tag to an entry for config.metabib_field. If you do a keyword search for "ship george west," you will successfully find this record - http://evergreen.noblenet.org/eg/opac/record/1622058?query=ship%20george%20west;qtype=keyword;locg=1 - based on records in the 962 field, but if you do a keyword search for "ship george west marblehead," you will get no results because marblehead is in a different entry of the metabib.keyword_field_entry table than the other keywords are.

For a brief period of 2.4, a new combined indexes feature was turned on by default to address this problem, but it was subsequently turned off due to the bug reported at https://bugs.launchpad.net/evergreen/+bug/1169693. Since we use additional indexes for the keyword class for the purpose of weighting, we don't want to turn them on because we're concerned about how it would affect relevance ranking.

* When adding an index to add a MARC tag that is not covered by the keyword blob, we also run into a problem of unintentionally adding weight to that index even when we don't want to give it additional weight. As an example, two of our Evergreen sites added the 260b tag to the keyword index so that the publisher could be searchable. We then had an issue where a keyword search for "free spirit," pulled up lots of results for books by a publisher called Free Spirit ahead of the book that was being sought (Free Spirit: Growing Up On the Road and Off the Grid). http://bark.cwmars.org/eg/opac/results?query=free+spirit&qtype=title&fg%3Aformat_filters=&locg=1&sort=. The weight for the publisher field is 1, and the weight for the title field is 10. However, those records come up sooner because they score so highly when it comes to coverage density.

If there were an easier way to include the publisher field in the blob, then we could have avoided this problem. However, there didn't seem to be an easy way to include it while continuing to exclude the rest of the fields that are contained within the MODS origininfo element.

* We're also aware of the fact that, since we have to add an index every time we need to add a MARC tag, we're adding more fields to the metabib.keyword_entry table. In addition, one of our sites has added 880 indexes using the method I described here - http://markmail.org/message/dlzvcxezaycychd4 , which indexes all records, not just those with an 880 field. We wonder if the addition of all these entries ultimately impacts performance. We are also looking at ways of indexing just those records with the 880 field when using the marc21expand880 format.

* With the addition of our custom indexes, one question that frequently arises is if we're hurting overall search performance by adding indexes that result in more entries for our metabib.keyword_field_entry tables. Since I work with three different Evergreen sites, we're able to do some comparison. The site that has added the fewest (if any) custom indexes does indeed seem to have faster keyword searches.

If a site is only using the default keyword blob and no other keyword indexes, the number of entries in the metabib_keyword_field_entry table should be the same as the number of bib records in the system. However, when looking at the sites that have added more custom indexes, we see as many as 10 times as many entries in the metabib_keyword_field_entry table as there are bib records in the database. At what point do the number of metabib entries begin to seriously impact search performance?

One of our sites is considering a different approach to configuring its search indexes, and I wanted to put our ideas out here to see if you have feedback on it.

Instead of using the keyword blob based on mods, they are experimenting with setting up all of their keyword indexes to be based on MARC tags. We've talked about two ways of making this happen:

* Creating another keyword blob, similar to the current default index, that includes all of the MARC tags and subfields that we want included in the keyword index. Under this approach, we could easily add tags like the 260b or the 962 without creating an additional config.metabib_field entry. Those tags would just be incorporated in the main blob and, therefore, wouldn't be adding unintended weight to keyword searches or running up against the combined indexes problem. Under this scenario, there would still be cases where we would need to configure additional keyword indexes whenever we wanted to apply additional weight to a specific MARC field.

* The other approach is to create multiple config.metabib_field entries for individual MARC tags, or perhaps for groups of tags. We would then turn the combined indexes on for keyword indexes. Based on Mike's comments at https://bugs.launchpad.net/evergreen/+bug/1169693/comments/2, since we no longer would have the large keyword blob, turning combined indexes on shouldn't lead to the relevance-ranking problems that arose when 2.4 was first released. We could then adjust the weight for those entries that include the more relevant MARC tags. However, I do have concerns that we'll continue to come across problems like the "Free Spirit" example I mentioned above where an exact match on a 260b tag will continue to float above other more relevant results.

While this site continues looking at the indexes, can you tell me if there is anything we would be losing by going from indexes based on MODS to ones based purely on MARC tags? What were the original reasons for basing the default indexes on MODS?

Also, are there any pros or cons to implementing either of the approaches I mentioned above? Do you see other ways to address some of the issues we've come across in creating custom indexes?

Thanks in advance for your insight!

Kathy


--
Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 343-0128
[email protected]
Twitter:http://www.twitter.com/kmlussier
#evergreen IRC: kmlussier

Reply via email to