On Mon, Apr 2, 2012 at 6:38 PM, graham <[email protected]> wrote:
> Hi Peter
>
> I'm actually using pazpar2 from Indexdata to merge my records; this
> limits you in some ways (in particular it merges on-the-fly, which
> blocks the use of your option 2 - the initial record on which the merger
> is based is not guaranteed to be the most fully populated), but has a

Yes -- it's not guaranteed and that's why pazpar2 has the merge option
'longest' which will select the richest metadata from all records on a
field by field basis. It usually produces better results. Not to
mention that in a federated search setting the 'initial' record cannot
be considered stable -- it may and will change from a search to
search.

Since you are using pazpar2 you know that, but I will just mention
that especially for bibliographic records, it really comes down to
looking at the metadata fields and selecting the best-fitting merging
algorithm: e.g for publication dates you may want to generate date
ranges that will be representative for the whole merged result or you
may want to pick-up all unique author fields from among all records.

> lot of configuration options, which makes it quite flexible.
>
> So, you can choose:
>
> 1. Which fields need to be identical for a record to be merged at all. I
> was using author, title, edition when I first mailed the list, but have
> found allowing records with different publication dates to be merged
> just caused too many unpredictable problems, and have now added
> publication date to the list of required fields. The test for identical
> authors, dates etc is just a string comparison, so a proportion of
> records which ought to be merged by these criteria never are, due to
> typos, variant names, date formats, etc.
>
> 2. What to do with fields which differ between records which are being
> merged. You can choose either 'unique', which appends all unique field
> values (this is what I use for subject headings, so exactly repeated
> subject headings are dropped, but variants are kept), and 'longest',
> which picks the longest field value from all the candidates (this is
> what I use for abstracts).
>
> At the end of the process you have a merged record which has a 'head'
> with the merged record itself, but which contains each of the original
> records, so you could potentially do as you suggest and let users see
> any of the input records if they wanted. However, by default this isn't
> Marc but an internal format (a processed subset of the Marc input) so it
> may not be much use to most users.
>
> I'm finding the 'head' section is mostly quite usable but does often
> have individual fields with strange or repeated values (eg values
> identical apart from punctuation). So I'm doing some post-processing of
> my own on this, but it's very arbitrary at the moment.
>
> Graham
>
> On 03/30/12 01:09, Peter Noerr wrote:
>> Hi Graham,
>>
>> What we do in our federated search system, and have been doing for some few 
>> years, is basically give the "designer" a choice of what options the user 
>> gets for "de-duped" records.
>>
>> Firstly de-duping can be of a number of levels of sophistication, and a many 
>> of them lead to the situation you have - records which are "similar" rather 
>> than identical. On the web search side of things there are a surprising 
>> number of real duplicates (well maybe not surprising if you study more than 
>> one page of web search engine results), and on Twitter the duplicates well 
>> outnumber the original posts (many thanks 're-tweet').
>>
>> Where we get duplicate records the usual options are: 1) keep the first and 
>> just drop all the rest. 2) keep the largest (assumed to have the most 
>> information) and drop the rest. These work well for WSE results where they 
>> are all almost identical (the differences often are just in the advertising 
>> attached to the pages and the results), but not for bibliographic records.
>>
>> Less draconian is 3) Mark all the duplicates and keep them in the list (so 
>> you get 1, 2, 3, 4, 5, 5.1, 5.2, 5.3, 6, ...). This groups all the similar 
>> records together under the sort key of the first one, and does enable the 
>> user to easily skip them.
>>
>> More user friendly is 4) Mark all duplicates and hide them in a sub-list 
>> attached to the "head" record. This gets them out of the main display, but 
>> allows the user who is interested in that "record" to expand the list and 
>> see the variants. This could be of use to you.
>>
>> After that we planned to do what you are proposing and actually merge record 
>> content into a single virtual record, and worked on algorithms to do it. But 
>> nobody was interested. All our partners (who provide systems to lots of 
>> libraries, both public, academic, and special) decided that it would confuse 
>> their users more than it would help. I have my doubts, but they spoke and we 
>> put the development on ice.
>>
>> I'm not sure this will help, but it has stood the test of time, and is well 
>> used in its various guises. Since no-one else seems interested in this 
>> topic, you could email me off list and we could discuss what we worked 
>> through in the way of algorithms, etc.
>>
>> Peter
>>
>>
>>> -----Original Message-----
>>> From: Code for Libraries [mailto:[email protected]] On Behalf Of 
>>> graham
>>> Sent: Wednesday, March 28, 2012 8:05 AM
>>> To: [email protected]
>>> Subject: Re: [CODE4LIB] presenting merged records?
>>>
>>> Hi Michael
>>>
>>> On 03/27/12 11:50, Michael Hopwood wrote:
>>>> Hi Graham, do I know you from RHUL?
>>>>
>>> Yes indeed :-)
>>>
>>>> My thoughts on "merged records" would be:
>>>>
>>>> 1. don't do it - use separate IDs and just present links between related 
>>>> manifestations; thus
>>> avoiding potential confusions.
>>>
>>> In my case, I can't avoid it as it's a specific requirement: I'm doing a 
>>> federated search across a
>>> large number of libraries, and if closely similar items aren't merged, the 
>>> results become excessively
>>> large and repetitive. I'm merging all the similar items, displaying a 
>>> summary of the merged
>>> bibliographic data, and providing links to each of the libraries with a 
>>> copy.  So it's not really
>>> FRBRization in the normal sense, I just thought that FRBRization would lead 
>>> to similar problems, so
>>> that there might be some well-known discussion of the issues around... The 
>>> merger of the records does
>>> have advantages, especially if some libraries have very underpopulated 
>>> records (eg subject fields).
>>>
>>> Cheers
>>> Graham
>>>
>>>>
>>>> http://www.bic.org.uk/files/pdfs/identification-digibook.pdf
>>>>
>>>> possible relationships - see 
>>>> http://www.editeur.org/ONIX/book/codelists/current.html - lists 51
>>> (manifestation)and 164 (work).
>>>>
>>>> 2. c.f. the way Amazon displays rough and ready categories (paperback,
>>>> hardback, audiobooks, *ahem* ebooks of some sort...)
>>>>
>>>> On dissection and reconstitution of records - there is a lot of talk going 
>>>> on about RDFizing MaRC
>>> records and re-using in various ways, e.g.:
>>>>
>>>> http://www.slideshare.net/JenniferBowen/moving-library-metadata-toward
>>>> -linked-data-opportunities-provided-by-the-extensible-catalog
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>> -----Original Message-----
>>>> From: Code for Libraries [mailto:[email protected]] On Behalf
>>>> Of graham
>>>> Sent: 27 March 2012 11:06
>>>> To: [email protected]
>>>> Subject: [CODE4LIB] presenting merged records?
>>>>
>>>> Hi
>>>>
>>>> There seems to be a general trend to presenting merged records to users, 
>>>> as part of the move towards
>>> FRBRization. If records need merging this generally means they weren't 
>>> totally identical to start with,
>>> so you can end up with conflicting bibliographic data to display.
>>>>
>>>> Two examples I've come across with this: Summon can merge
>>>> print/electronic versions of texts, so uses a new 'merged' material
>>>> type of 'book/ebook' (it doesn't yet seem to have all the other
>>>> possible permutations, eg book/audiobook). Pazpar2 (which I'm working
>>>> with at the
>>>> moment) has a merge option for publication dates which presents dates as a 
>>>> period eg 1997-2002.
>>>>
>>>> The problem is not with the underlying data (the original unmerged values 
>>>> can still be there in the
>>> background) but how to present them to the user in an intuitive way. With 
>>> the date example, presenting
>>> dates in this format sometimes throws people as it looks too much like the 
>>> author birth/death dates
>>> you might see with a record.
>>>>
>>>> I guess people must generally be starting to run into this kind of display 
>>>> problem, so it has maybe
>>> been discussed to death on ... wherever it is people talk about 
>>> FRBRIzation. Any suggestions? Any
>>> mailing lists, blogs etc any can recommend for me to look at?
>>>>
>>>> Thanks for any ideas
>>>> Graham



-- 

Cheers,
Jakub

Reply via email to