On Mon, Apr 2, 2012 at 6:38 PM, graham <[email protected]> wrote: > Hi Peter > > I'm actually using pazpar2 from Indexdata to merge my records; this > limits you in some ways (in particular it merges on-the-fly, which > blocks the use of your option 2 - the initial record on which the merger > is based is not guaranteed to be the most fully populated), but has a
Yes -- it's not guaranteed and that's why pazpar2 has the merge option 'longest' which will select the richest metadata from all records on a field by field basis. It usually produces better results. Not to mention that in a federated search setting the 'initial' record cannot be considered stable -- it may and will change from a search to search. Since you are using pazpar2 you know that, but I will just mention that especially for bibliographic records, it really comes down to looking at the metadata fields and selecting the best-fitting merging algorithm: e.g for publication dates you may want to generate date ranges that will be representative for the whole merged result or you may want to pick-up all unique author fields from among all records. > lot of configuration options, which makes it quite flexible. > > So, you can choose: > > 1. Which fields need to be identical for a record to be merged at all. I > was using author, title, edition when I first mailed the list, but have > found allowing records with different publication dates to be merged > just caused too many unpredictable problems, and have now added > publication date to the list of required fields. The test for identical > authors, dates etc is just a string comparison, so a proportion of > records which ought to be merged by these criteria never are, due to > typos, variant names, date formats, etc. > > 2. What to do with fields which differ between records which are being > merged. You can choose either 'unique', which appends all unique field > values (this is what I use for subject headings, so exactly repeated > subject headings are dropped, but variants are kept), and 'longest', > which picks the longest field value from all the candidates (this is > what I use for abstracts). > > At the end of the process you have a merged record which has a 'head' > with the merged record itself, but which contains each of the original > records, so you could potentially do as you suggest and let users see > any of the input records if they wanted. However, by default this isn't > Marc but an internal format (a processed subset of the Marc input) so it > may not be much use to most users. > > I'm finding the 'head' section is mostly quite usable but does often > have individual fields with strange or repeated values (eg values > identical apart from punctuation). So I'm doing some post-processing of > my own on this, but it's very arbitrary at the moment. > > Graham > > On 03/30/12 01:09, Peter Noerr wrote: >> Hi Graham, >> >> What we do in our federated search system, and have been doing for some few >> years, is basically give the "designer" a choice of what options the user >> gets for "de-duped" records. >> >> Firstly de-duping can be of a number of levels of sophistication, and a many >> of them lead to the situation you have - records which are "similar" rather >> than identical. On the web search side of things there are a surprising >> number of real duplicates (well maybe not surprising if you study more than >> one page of web search engine results), and on Twitter the duplicates well >> outnumber the original posts (many thanks 're-tweet'). >> >> Where we get duplicate records the usual options are: 1) keep the first and >> just drop all the rest. 2) keep the largest (assumed to have the most >> information) and drop the rest. These work well for WSE results where they >> are all almost identical (the differences often are just in the advertising >> attached to the pages and the results), but not for bibliographic records. >> >> Less draconian is 3) Mark all the duplicates and keep them in the list (so >> you get 1, 2, 3, 4, 5, 5.1, 5.2, 5.3, 6, ...). This groups all the similar >> records together under the sort key of the first one, and does enable the >> user to easily skip them. >> >> More user friendly is 4) Mark all duplicates and hide them in a sub-list >> attached to the "head" record. This gets them out of the main display, but >> allows the user who is interested in that "record" to expand the list and >> see the variants. This could be of use to you. >> >> After that we planned to do what you are proposing and actually merge record >> content into a single virtual record, and worked on algorithms to do it. But >> nobody was interested. All our partners (who provide systems to lots of >> libraries, both public, academic, and special) decided that it would confuse >> their users more than it would help. I have my doubts, but they spoke and we >> put the development on ice. >> >> I'm not sure this will help, but it has stood the test of time, and is well >> used in its various guises. Since no-one else seems interested in this >> topic, you could email me off list and we could discuss what we worked >> through in the way of algorithms, etc. >> >> Peter >> >> >>> -----Original Message----- >>> From: Code for Libraries [mailto:[email protected]] On Behalf Of >>> graham >>> Sent: Wednesday, March 28, 2012 8:05 AM >>> To: [email protected] >>> Subject: Re: [CODE4LIB] presenting merged records? >>> >>> Hi Michael >>> >>> On 03/27/12 11:50, Michael Hopwood wrote: >>>> Hi Graham, do I know you from RHUL? >>>> >>> Yes indeed :-) >>> >>>> My thoughts on "merged records" would be: >>>> >>>> 1. don't do it - use separate IDs and just present links between related >>>> manifestations; thus >>> avoiding potential confusions. >>> >>> In my case, I can't avoid it as it's a specific requirement: I'm doing a >>> federated search across a >>> large number of libraries, and if closely similar items aren't merged, the >>> results become excessively >>> large and repetitive. I'm merging all the similar items, displaying a >>> summary of the merged >>> bibliographic data, and providing links to each of the libraries with a >>> copy. So it's not really >>> FRBRization in the normal sense, I just thought that FRBRization would lead >>> to similar problems, so >>> that there might be some well-known discussion of the issues around... The >>> merger of the records does >>> have advantages, especially if some libraries have very underpopulated >>> records (eg subject fields). >>> >>> Cheers >>> Graham >>> >>>> >>>> http://www.bic.org.uk/files/pdfs/identification-digibook.pdf >>>> >>>> possible relationships - see >>>> http://www.editeur.org/ONIX/book/codelists/current.html - lists 51 >>> (manifestation)and 164 (work). >>>> >>>> 2. c.f. the way Amazon displays rough and ready categories (paperback, >>>> hardback, audiobooks, *ahem* ebooks of some sort...) >>>> >>>> On dissection and reconstitution of records - there is a lot of talk going >>>> on about RDFizing MaRC >>> records and re-using in various ways, e.g.: >>>> >>>> http://www.slideshare.net/JenniferBowen/moving-library-metadata-toward >>>> -linked-data-opportunities-provided-by-the-extensible-catalog >>>> >>>> Cheers, >>>> >>>> Michael >>>> >>>> -----Original Message----- >>>> From: Code for Libraries [mailto:[email protected]] On Behalf >>>> Of graham >>>> Sent: 27 March 2012 11:06 >>>> To: [email protected] >>>> Subject: [CODE4LIB] presenting merged records? >>>> >>>> Hi >>>> >>>> There seems to be a general trend to presenting merged records to users, >>>> as part of the move towards >>> FRBRization. If records need merging this generally means they weren't >>> totally identical to start with, >>> so you can end up with conflicting bibliographic data to display. >>>> >>>> Two examples I've come across with this: Summon can merge >>>> print/electronic versions of texts, so uses a new 'merged' material >>>> type of 'book/ebook' (it doesn't yet seem to have all the other >>>> possible permutations, eg book/audiobook). Pazpar2 (which I'm working >>>> with at the >>>> moment) has a merge option for publication dates which presents dates as a >>>> period eg 1997-2002. >>>> >>>> The problem is not with the underlying data (the original unmerged values >>>> can still be there in the >>> background) but how to present them to the user in an intuitive way. With >>> the date example, presenting >>> dates in this format sometimes throws people as it looks too much like the >>> author birth/death dates >>> you might see with a record. >>>> >>>> I guess people must generally be starting to run into this kind of display >>>> problem, so it has maybe >>> been discussed to death on ... wherever it is people talk about >>> FRBRIzation. Any suggestions? Any >>> mailing lists, blogs etc any can recommend for me to look at? >>>> >>>> Thanks for any ideas >>>> Graham -- Cheers, Jakub
