Josh, btrim would be the natural normalizer to use, so let's test the timing to see if it's faster...
evergreen=# explain analyze select count(btrim(value)) from metabib.real_full_rec ; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=6129.66..6129.67 rows=1 width=11) (actual time=542.861..542.862 rows=1 loops=1) -> Seq Scan on real_full_rec (cost=0.00..4989.77 rows=227977 width=11) (actual time=0.010..221.404 rows=228822 loops=1) Planning time: 0.080 ms Execution time: 542.899 ms (4 rows) evergreen=# explain analyze select count(regexp_replace(value,' *$','','')) from metabib.real_full_rec ; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=6129.66..6129.67 rows=1 width=11) (actual time=818.893..818.894 rows=1 loops=1) -> Seq Scan on real_full_rec (cost=0.00..4989.77 rows=227977 width=11) (actual time=0.010..230.265 rows=228822 loops=1) Planning time: 0.079 ms Execution time: 818.931 ms (4 rows) btrim is almost 50% faster! I didn't expect that, actually. So I'd recommend using btrim instead. Your future self will thank you on your next full reingest. HTH, -- Mike Rylander | President | Equinox Open Library Initiative | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@equinoxinitiative.org | web: http://equinoxinitiative.org On Thu, Mar 2, 2017 at 9:29 AM, Josh Stompro <stomp...@exchange.larl.org> wrote: > Jason, this alone seems to leave trailing spaces in the facet entry table, > since the space before the semicolon is left, which is required for the > series index to not concatenate the last word of one 490 with the first > word of the next 490. > > > > I tried adding a second normalizer that just strips trailing spaces and > that seems to take care of it. > > insert into config.metabib_field_index_norm_map (field,norm,params,pos) > values (1,18,'[" *$","",""]',-1); > > -- Change the first normazlier position to -2. > > > > There is also the btrim normalizer, I don’t know if that would be a > better/faster than using another regexp_replace. > > > > Josh Stompro - LARL IT Director > > > > *From:* Open-ils-general [mailto:open-ils-general- > boun...@list.georgialibraries.org] *On Behalf Of *Boyer, Jason A > *Sent:* Wednesday, March 01, 2017 10:22 AM > *To:* Evergreen Discussion Group > > *Subject:* Re: [OPEN-ILS-GENERAL] Series index, only first entry getting > indexed > > > > Thanks for figuring this out, Josh. I was able to modify our normalizer > like so to continue removing the $v: > > BEGIN; > > UPDATE config. index_normalizer SET param_count =3 WHERE id IN (SELECT id > FROM config. index_normalizer WHERE func = 'regexp_replace'); > > UPDATE config.metabib_field_index_norm_map SET params='["; > *[0-9]*","","g"]' WHERE field = 1 and norm in (SELECT id FROM config. > index_normalizer WHERE func = 'regexp_replace'); > > COMMIT; > > > > If you have more than 1 normalizer that uses regexp_replace or are using > it on more than one field you won't want to use this as-is, but if you only > have the 1 and are currently only using it on your series titles it's good > to go. > > > > Jason > > > > -- > > Jason Boyer > > MIS Supervisor > > Indiana State Library > > http://library.in.gov/ > > > > *From:* Open-ils-general [mailto:open-ils-general- > boun...@list.georgialibraries.org > <open-ils-general-boun...@list.georgialibraries.org>] *On Behalf Of *Josh > Stompro > *Sent:* Wednesday, March 01, 2017 10:41 AM > *To:* Evergreen Discussion Group <open-ils-general@list. > georgialibraries.org> > *Subject:* Re: [OPEN-ILS-GENERAL] Series index, only first entry getting > indexed > > > > **** This is an EXTERNAL email. Exercise caution. DO NOT open attachments > or click links from unknown senders or unexpected email. **** > ------------------------------ > > Removing the regex replace normalizer did take care of it, sorry I didn’t > try that before posting. I think my regex will have to be more selective, > only getting rid of the number and the ‘;’ so it doesn’t clear out too much > data. > > > > Josh Stompro - LARL IT Director > > > > *From:* Open-ils-general [mailto:open-ils-general- > boun...@list.georgialibraries.org > <open-ils-general-boun...@list.georgialibraries.org>] *On Behalf Of *Josh > Stompro > *Sent:* Wednesday, March 01, 2017 9:19 AM > *To:* open-ils-general@list.georgialibraries.org > *Subject:* [OPEN-ILS-GENERAL] Series index, only first entry getting > indexed > > > > Hello, we have noticed that only the first 490 get indexed for our series > search index. But all 490’s get added to the series facet entry. > > > > For example, here is a title with two 490’s in mods32 format. > > https://egcatalog.larl.org/opac/extras/unapi?id=tag::U2@ > bre/237592&format=mods32 > > > > The second 490 of “Felicity classic” isn’t searchable. > > > > When I look at the metabib.combined_series_field_entry I see the > following for this record. > > *record* > > *metabib_field* > > *index_vector* > > 237592 > > 'american' 'beforev' 'beforever' 'felic' 'felicity' 'girl' > > 237592 > > 1 > > 'american' 'beforev' 'beforever' 'felic' 'felicity' 'girl' > > > > metabib.series_field_entry > > *id* > > *source* > > *field* > > *Value* > > *index_vector* > > 430451 > > 237592 > > 1 > > American Girl Beforever Felicity > > 'american':1A,5C 'beforev':7C 'beforever':3A 'felic':8C 'felicity':4A > 'girl':2A,6C > > > > Metabib.facet_entry > > *value* > > *count* > > *bibid* > > American Girl Beforever Felicity > > 1 > > 237592 > > Felicity classic > > 1 > > 237592 > > > > > > The one thing that I have done is to add a search normalizer to get rid of > the series numbering from the facet entry. Unfortunately I don’t remember > if this issue came up before I added the normalizer. Maybe when used on > the index version the regex replace is actually acting on all the 490 info > concatenated together, so by getting rid of everything after the first ‘ ;’ > I’m clearing the second 490 entry data? But it does work correctly on the > facet data? > > > > There is a note on https://wiki.evergreen-ils. > org/doku.php?id=documentation:indexing#field_normalization_settings > > “*Note:* Only normalizations with a negative *pos* value are applied to > the facet version of indexed terms!” But that must not mean that the > normalizer only acts on the facet when there is a negative pos value? > > > > This is going to be wide, but here is our normalizer setup and our series > metabib field info. > > > > *id* > > *field* > > *norm* > > *params* > > *pos* > > *id* > > *field_class* > > *name* > > *label* > > *xpath* > > *weight* > > *format* > > *search_field* > > *facet_field* > > *browse_field* > > *browse_xpath* > > *browse_sort_xpath* > > *facet_xpath* > > *authority_xpath* > > *joiner* > > *restrict* > > *id* > > *name* > > *description* > > *func* > > *param_count* > > 51 > > 32 > > 2 > > 0 > > 32 > > series > > browse > > Series Title (Browse) > > //mods32:mods/mods32:relatedItem[@type="series"]/ > mods32:titleInfo[@type="nfi"] > > 1 > > mods32 > > false > > false > > true > > *[local-name() != "nonSort"] > > //@xlink:href > > false > > 2 > > Normalize date range > > Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper > index. > > split_date_range > > 0 > > 1 > > 1 > > 2 > > 0 > > 1 > > series > > seriestitle > > Series Title > > //mods32:mods/mods32:relatedItem[@type="series"]/ > mods32:titleInfo[not(@type="nfi")] > > 1 > > mods32 > > true > > true > > false > > //@xlink:href > > false > > 2 > > Normalize date range > > Split date ranges in the form of "XXXX-YYYY" into "XXXX YYYY" for proper > index. > > split_date_range > > 0 > > 62 > > 1 > > 13 > > ["[",""] > > -1 > > 1 > > series > > seriestitle > > Series Title > > //mods32:mods/mods32:relatedItem[@type="series"]/ > mods32:titleInfo[not(@type="nfi")] > > 1 > > mods32 > > true > > true > > false > > //@xlink:href > > false > > 13 > > Replace > > Replace all occurences of first parameter in the string with the second > parameter. > > replace > > 2 > > 61 > > 1 > > 13 > > ["]",""] > > -1 > > 1 > > series > > seriestitle > > Series Title > > //mods32:mods/mods32:relatedItem[@type="series"]/ > mods32:titleInfo[not(@type="nfi")] > > 1 > > mods32 > > true > > true > > false > > //@xlink:href > > false > > 13 > > Replace > > Replace all occurences of first parameter in the string with the second > parameter. > > replace > > 2 > > 52 > > 32 > > 17 > > 0 > > 32 > > series > > browse > > Series Title (Browse) > > //mods32:mods/mods32:relatedItem[@type="series"]/ > mods32:titleInfo[@type="nfi"] > > 1 > > mods32 > > false > > false > > true > > *[local-name() != "nonSort"] > > //@xlink:href > > false > > 17 > > Search Normalize > > Apply search normalization rules to the extracted text. A less extreme > version of NACO normalization. > > search_normalize > > 0 > > 2 > > 1 > > 17 > > 0 > > 1 > > series > > seriestitle > > Series Title > > //mods32:mods/mods32:relatedItem[@type="series"]/ > mods32:titleInfo[not(@type="nfi")] > > 1 > > mods32 > > true > > true > > false > > //@xlink:href > > false > > 17 > > Search Normalize > > Apply search normalization rules to the extracted text. A less extreme > version of NACO normalization. > > search_normalize > > 0 > > 64 > > 1 > > 18 > > [" *;.*",""] > > -1 > > 1 > > series > > seriestitle > > Series Title > > //mods32:mods/mods32:relatedItem[@type="series"]/ > mods32:titleInfo[not(@type="nfi")] > > 1 > > mods32 > > true > > true > > false > > //@xlink:href > > false > > 18 > > Replace by regular expression > > regexp_replace > > 2 > > > > Thanks for any ideas you might have. > > Josh > > > > Lake Agassiz Regional Library - Moorhead MN larl.org > > Josh Stompro | Office 218.233.3757 EXT-139 <(218)%20233-3757> > > LARL IT Director | Cell 218.790.2110 <(218)%20790-2110> > > >