[OPEN-ILS-DEV] Questions about keyword search indexing

Kathy Lussier Thu, 29 Jan 2015 08:06:58 -0800

Hi all,

Since implementing Evergreen, the consortia I work with have addedseveral custom indexes, mostly for the keyword class, to address thefollowing issues:

* Add weights for particular indexes (title, subject, author) for akeyword search.* Add MARC fields that are not covered by MODS or are excluded from thedefault keyword index (aka the blob).

* Add indexes for 880 fields.

Although we love the way the system is flexible enough to allow us tomake these adjustments, we've come across a few problems that has usre-thinking the way we set up our indexes. I wanted to send along someof the problems we've encountered and the solutions we're considering tosee if you all have feedback on our potential solutions, if you seedrawbacks we hadn't considered, or if you have alternative approachesthat might work.


Some of the issues we've encountered:

* When adding an index to add a MARC tag that is not covered by thekeyword blob, we run into the problem where these indexes are not"combined." For example, NOBLE added a local field (962) for captions ofdigital images to its records and then added this tag to an entry forconfig.metabib_field. If you do a keyword search for "ship georgewest," you will successfully find this record -http://evergreen.noblenet.org/eg/opac/record/1622058?query=ship%20george%20west;qtype=keyword;locg=1- based on records in the 962 field, but if you do a keyword search for"ship george west marblehead," you will get no results becausemarblehead is in a different entry of the metabib.keyword_field_entrytable than the other keywords are.

For a brief period of 2.4, a new combined indexes feature was turned onby default to address this problem, but it was subsequently turned offdue to the bug reported athttps://bugs.launchpad.net/evergreen/+bug/1169693. Since we useadditional indexes for the keyword class for the purpose of weighting,we don't want to turn them on because we're concerned about how it wouldaffect relevance ranking.

* When adding an index to add a MARC tag that is not covered by thekeyword blob, we also run into a problem of unintentionally addingweight to that index even when we don't want to give it additionalweight. As an example, two of our Evergreen sites added the 260b tag tothe keyword index so that the publisher could be searchable. We then hadan issue where a keyword search for "free spirit," pulled up lots ofresults for books by a publisher called Free Spirit ahead of the bookthat was being sought (Free Spirit: Growing Up On the Road and Off theGrid).http://bark.cwmars.org/eg/opac/results?query=free+spirit&qtype=title&fg%3Aformat_filters=&locg=1&sort=.The weight for the publisher field is 1, and the weight for the titlefield is 10. However, those records come up sooner because they score sohighly when it comes to coverage density.

If there were an easier way to include the publisher field in the blob,then we could have avoided this problem. However, there didn't seem tobe an easy way to include it while continuing to exclude the rest of thefields that are contained within the MODS origininfo element.

* We're also aware of the fact that, since we have to add an index everytime we need to add a MARC tag, we're adding more fields to themetabib.keyword_entry table. In addition, one of our sites has added 880indexes using the method I described here -http://markmail.org/message/dlzvcxezaycychd4 , which indexes allrecords, not just those with an 880 field. We wonder if the addition ofall these entries ultimately impacts performance. We are also looking atways of indexing just those records with the 880 field when using themarc21expand880 format.

* With the addition of our custom indexes, one question that frequentlyarises is if we're hurting overall search performance by adding indexesthat result in more entries for our metabib.keyword_field_entry tables.Since I work with three different Evergreen sites, we're able to do somecomparison. The site that has added the fewest (if any) custom indexesdoes indeed seem to have faster keyword searches.

If a site is only using the default keyword blob and no other keywordindexes, the number of entries in the metabib_keyword_field_entry tableshould be the same as the number of bib records in the system. However,when looking at the sites that have added more custom indexes, we see asmany as 10 times as many entries in the metabib_keyword_field_entrytable as there are bib records in the database. At what point do thenumber of metabib entries begin to seriously impact search performance?

One of our sites is considering a different approach to configuring itssearch indexes, and I wanted to put our ideas out here to see if youhave feedback on it.

Instead of using the keyword blob based on mods, they are experimentingwith setting up all of their keyword indexes to be based on MARC tags.We've talked about two ways of making this happen:

* Creating another keyword blob, similar to the current default index,that includes all of the MARC tags and subfields that we want includedin the keyword index. Under this approach, we could easily add tags likethe 260b or the 962 without creating an additional config.metabib_fieldentry. Those tags would just be incorporated in the main blob and,therefore, wouldn't be adding unintended weight to keyword searches orrunning up against the combined indexes problem. Under this scenario,there would still be cases where we would need to configure additionalkeyword indexes whenever we wanted to apply additional weight to aspecific MARC field.

* The other approach is to create multiple config.metabib_field entriesfor individual MARC tags, or perhaps for groups of tags. We would thenturn the combined indexes on for keyword indexes. Based on Mike'scomments athttps://bugs.launchpad.net/evergreen/+bug/1169693/comments/2, since weno longer would have the large keyword blob, turning combined indexes onshouldn't lead to the relevance-ranking problems that arose when 2.4 wasfirst released. We could then adjust the weight for those entries thatinclude the more relevant MARC tags. However, I do have concerns thatwe'll continue to come across problems like the "Free Spirit" example Imentioned above where an exact match on a 260b tag will continue tofloat above other more relevant results.

While this site continues looking at the indexes, can you tell me ifthere is anything we would be losing by going from indexes based on MODSto ones based purely on MARC tags? What were the original reasons forbasing the default indexes on MODS?

Also, are there any pros or cons to implementing either of theapproaches I mentioned above? Do you see other ways to address some ofthe issues we've come across in creating custom indexes?


Thanks in advance for your insight!

Kathy


--
Kathy Lussier
Project Coordinator
Massachusetts Library Network Cooperative
(508) 343-0128
[email protected]
Twitter:http://www.twitter.com/kmlussier
#evergreen IRC: kmlussier

[OPEN-ILS-DEV] Questions about keyword search indexing

Reply via email to