I actually created a backup in the staging schema did a test update on that data first before doing it on the production data using the select statement based on the bib id I had in the staging table.
On Wed, Apr 3, 2013 at 5:48 PM, Liam Whalen <[email protected]> wrote: > I would still use SELECTs to see what you are replacing. Subfield u is > repeatable, so if you have a 856 field with two URLs, one with your HCC > link, and the other with a a different link but an HCC short name, that > regex will replace that as well. > > That is most likely an unusual situation, but if you are not really > familiar with this data there is no way to know. I guess there is only so > much that can be done with bad data, but I always feel better knowing that > I'm not getting rid of something I am not anticipating deleting. > > Liam > > On 2013-04-03, at 9:13 AM, Tim Spindler <[email protected]> wrote: > > Thanks fro all the replies. I did get the following regex to work > > regexp_replace(marc,E'<datafield tag="856" ind1="4" ind2="0"><subfield > code="u"> > http://ezhc.ez.cwmars.org:4000/login.url=http://digital.films.com/PortalPlaylists.aspx.aid=9206.+?HCC > </subfield></datafield>','','g') > > The url is unique to Films on Demand for HCC so that seems to work well > to identify the beginning part of the 856 and then the HCC part identifies > the end. > > > > > On Wed, Apr 3, 2013 at 6:51 AM, Liam Whalen <[email protected]> wrote: > >> Sorry for the chain replying to my own emails. But, that code is not >> exactly right. using matches between ^ $ will not get you what you want >> exactly, that will only return the first occurrence of 856 replaced in a >> record. It has been over a year since I've done stuff like this, so I'm >> sorry if the code is not very usable. However, the code wasn't really my >> main point. What I've learned using regex to replace data is I expect >> regex to get rid of useful data because there is no way to predict the >> outliers in the data. Almost definitely, there will be some 856s fields >> that do not conform to typical pattern in your data, and regex may remove >> them, in which case you will lose useful data. >> >> Liam >> On 2013-04-03, at 6:35 AM, Liam Whalen <[email protected]> wrote: >> >> I forgot to mention checking the data that was replaced: >> >> SELECT id, replaced FROM >> (SELECT id, >> regexp_replace(marc,E'^(?:(?!<datafield[^>]+?tag="856").)*(<datafield[^>]+?tag="856".+?<subfield >> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>').*$,'\1','g') >> AS replaced FROM biblio.record_entry WHERE marc ~ >> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>') >> WHERE length(replaced) > 0 >> >> That should give you all the fields that were replaced and the ids for >> their records. You can then use further selects to make sure you are only >> replacing what you think you are replacing. For instance, you can look for >> data that was coded improperly. Perhaps there is a link with an HCC in 9 >> that isn't really an HCC link. >> >> I'm not sure what determines if it is HCC, maybe its the port or the >> aid=#? >> >> Something like >> >> SELECT id, regexp_replace(replaced, E'^(?:(?!<subfield >> code="u">).)*([^<]*)</subfield>','\1','g') AS url FROM >> (SELECT id, replaced FROM >> (SELECT id, >> regexp_replace(marc,E'^(?:(?!<datafield[^>]+?tag="856").)*(<datafield[^>]+?tag="856".+?<subfield >> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>').*$,'\1','g') >> AS replaced FROM biblio.record_entry WHERE marc ~ >> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>') >> WHERE length(replaced) > 0)) >> >> WHERE url !~ '.*aid=9206.*' >> >> That should give you urls you are replacing that do not have aid=9206. >> I'm not sure what data you need to check for, but the idea is to not >> replace data that has been entered incorrectly. While it might have HCC >> in subfield 9, that could be there incorrectly, in which case you will be >> eliminating a useful URL that has been coded incorrectly. There are other >> things to check for. I'm not sure if there are limitations on the data's >> use of subfield 9. However it might be repeated in a url indicating that a >> link belongs to more than one shortname. In which case the second subfield >> would get eaten by the negative lookahead, and the 856 tag would be removed >> when it may still be valid for another shortname. >> >> Liam >> >> >> On 2013-04-02, at 11:32 PM, Liam Whalen <[email protected]> wrote: >> >> For the first SELECT you can use something like SELECT >> regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield >> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g') FROM >> biblio.record_entry WHERE marc ~ >> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>' >> >> That should give you your datafields without HCC. If you can search >> through the results for HCC and find any examples of HCC left, then you can >> improve the regex to catch them. Or opt for another non-regex route. If >> there are no HCC's left, then check to make sure you aren't missing some >> due to something else in the WHERE regex. >> >> SELECT no_hcc FROM >> (SELECT regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield >> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g') AS >> no_hcc FROM biblio.record_entry WHERE marc ~ >> '<datafield[^>]+tag="856"(?:(?!</datafield>).)*</datafield>' ) >> WHERE no_hcc ~* <subfield code="9">HCC</subfield>' >> >> Its been a while since I've written that type SQL, and I don't have an >> Evergreen database to test it on. So, that may not work exactly as >> written, but that is the general idea on how to double check mass db >> updates. The idea with the last WHERE clause is to look for the least >> amount of data that could catch the HCC subfield. If you are willing to >> sort through some false positives, it might be worthwhile changing the >> where to '>HCC<' or even 'HCC', just to make sure you are getting >> everything. The idea being to look for malformed MARCXML. As well, I put >> a case insensitive regex search (~*) in the last WHERE clause. If you are >> sure all the HCCs are upper case you will not need that. >> >> Liam >> >> >> On 2013-04-02, at 10:20 PM, Liam Whalen <[email protected]> wrote: >> >> You could also use negative lookahead to capture the 856 fields where >> subfield 9 is not at the end. >> >> regexp_replace(marc,E'<datafield[^>]+?tag="856".+?<subfield >> code="9">HCC</subfield>(?:(?!</datafield>).)*</datafield>','','g') >> >> I did not test this much either, but it does pick up only your HCC fields >> in the data you provided below. That being said, I was always surprised >> when using regex on large sets of data. A good way to test this is to run >> it as a SELECT first to try and pull records with that regex, then grep >> through them to make sure you are only getting rid of the data you want. >> Then for completeness, do another select for 856 fields with a WHERE >> clause using your regex to exclude the items you just verified for >> completeness, and grep through those results to make sure you are not >> missing any 9 = HCC fields. >> >> Liam >> >> >> On 2013-04-02, at 10:14 AM, Mike Rylander <[email protected]> wrote: >> >> On Tue, Apr 2, 2013 at 9:43 AM, Tim Spindler <[email protected]>wrote: >> >>> This is the regex I was trying but in testing it wasn't working. >>> >>> regexp_replace(marc,E'<datafield[^>]*?tag="856"+?<subfield >>> code="9">HCC</subfield></datafield>','','g') >>> >>> >> Perhaps something like: regexp_replace(marc,E'<datafield[^>]+?tag="856".+ >> ?<subfield code="9">HCC</subfield></datafield>','','g') >> >> As long as subfield 9 is at the end of all the datafields, that should >> work. It's not heavily tested. >> >> --miker >> >> >>> On Tue, Apr 2, 2013 at 9:41 AM, Tim Spindler <[email protected]>wrote: >>> >>>> Mike, >>>> >>>> For the batch update, it is about 6000 records with bucket, I didn't >>>> think Evergreen could handle that many in a bucket. >>>> >>>> Tim >>>> >>>> >>>> On Tue, Apr 2, 2013 at 9:20 AM, Mike Rylander <[email protected]>wrote: >>>> >>>>> If your removal code is regexp-based, you'll just need to match on HCC >>>>> in the middle of the "find for removal" regexp. Alternatively, you could >>>>> apply the batch MARC update function to a bucket or CSV file full of the >>>>> records. That has the ability to modify records conditionally, based on >>>>> subfield content within a datafield. >>>>> >>>>> --miker >>>>> >>>>> >>>>> >>>>> On Tue, Apr 2, 2013 at 9:12 AM, Tim Spindler <[email protected]>wrote: >>>>> >>>>>> We have a library that wants to reload its Films on Demand records. >>>>>> These records have multiple 856s for other institutions with the >>>>>> subfield 9 >>>>>> associated with their shortname. What would the best way be to remove >>>>>> these for instance. I have created a list of the records needing >>>>>> updating >>>>>> in the staging table. Here is an example of the of he 856. I have code >>>>>> that can remove all the 856s but am having trouble removing just one. We >>>>>> just want to remove the one for HCC (although I'll take suggests for >>>>>> removing the duplicate QCC lines also) >>>>>> >>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here >>>>>> to access this film online (MCLA students, faculty, and staff >>>>>> only)</subfield><subfield code="u"> >>>>>> http://libproxy.mcla.edu:2048/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=11580&xtid=152<http://libproxy.mcla.edu:2048/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=11580&xtid=152></subfield><subfield >>>>>> code="9">MCLA</subfield></datafield> >>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="u"> >>>>>> http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152<http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152></subfield><subfield >>>>>> code="y">Click here to access this film online (HCC Students and Staff >>>>>> only)</subfield><subfield code="9">HCC</subfield></datafield> >>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here >>>>>> to access this film online (QCC studetns, faculty, & staff >>>>>> only)</subfield><subfield code="u"> >>>>>> http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152<http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152></subfield><subfield >>>>>> code="9">QCC</subfield></datafield> >>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="u"> >>>>>> http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152<http://ezhc.ez.cwmars.org:4000/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=9206&xtid=152></subfield><subfield >>>>>> code="y">Click here to access this film online (HCC Students and Staff >>>>>> only)</subfield><subfield code="9">HCC</subfield></datafield> >>>>>> <datafield tag="856" ind1="4" ind2="0"><subfield code="y">Click here >>>>>> to access this film online (QCC studetns, faculty, & staff >>>>>> only)</subfield><subfield code="u"> >>>>>> http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152<http://ezqc.ez.cwmars.org:5200/login?url=http://digital.films.com/PortalPlaylists.aspx?aid=19277&xtid=152></subfield><subfield >>>>>> code="9">QCC</subfield></datafield> >>>>>> >>>>>> >>>>>> -- >>>>>> Tim Spindler >>>>>> [email protected] >>>>>> >>>>>> *P** Go Green - **Save a tree! Please don't print this e-mail >>>>>> unless it's really necessary.* >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Mike Rylander >>>>> | Director of Research and Development >>>>> | Equinox Software, Inc. / Your Library's Guide to Open Source >>>>> | phone: 1-877-OPEN-ILS (673-6457) >>>>> | email: [email protected] >>>>> | web: http://www.esilibrary.com >>>>> >>>> >>>> >>>> >>>> -- >>>> Tim Spindler >>>> [email protected] >>>> >>>> *P** Go Green - **Save a tree! Please don't print this e-mail unless >>>> it's really necessary.* >>>> >>>> >>>> >>> >>> >>> >>> -- >>> Tim Spindler >>> [email protected] >>> >>> *P** Go Green - **Save a tree! Please don't print this e-mail unless >>> it's really necessary.* >>> >>> >>> >> >> >> >> -- >> Mike Rylander >> | Director of Research and Development >> | Equinox Software, Inc. / Your Library's Guide to Open Source >> | phone: 1-877-OPEN-ILS (673-6457) >> | email: [email protected] >> | web: http://www.esilibrary.com >> >> >> >> >> >> > > > -- > Tim Spindler > [email protected] > > *P** Go Green - **Save a tree! Please don't print this e-mail unless > it's really necessary.* > > > > -- Tim Spindler [email protected] *P** Go Green - **Save a tree! Please don't print this e-mail unless it's really necessary.*
