Hello,

Here's a query that get's me almost there - I'm having trouble on next steps to 
improve:

SELECT  id,
        regexp_replace(
                marc, 
                E'.+?<datafield tag="856".+?><subfield code="u">([^<]+).*$',
                E'\\1'
        ) as URL
FROM    biblio.record_entry 
WHERE   marc LIKE '%<datafield tag="856"%' and deleted = 'False' and id > '0' 
limit 200

Problem:

It appears that for records that have more than one URL, the URL column returns 
the full marcxml data value instead of URLs. I need one record id, URL row for 
each URL in the db (and of course not the normalized-for-indexing version 
obtainable from the relevant metabib table).

Eg.

8214622  http://ir.lib.sfu.ca/handle/1892/10598
8214443  <record xmlns:xsi="http://www.w3.org/2001/X..... [this record contains 
multiple 856s]
8213549  http://apps1.gdr.nrcan.gc.ca/mirage/mirage_list_e.php?id=247370        
 

Any thoughts on how to improve this query to support the proper extraction of 
id, raw URLs that includes proper extraction from multi-856 records too?

Thanks,
George

George Duimovich
NRCan Library / Bibliothèque de RNCan

Reply via email to