Of possible interest is some work we've done to take the clustering 
capabilities of OpenRefine and bake them into our metadata editing interface 
for The Portal to Texas History and the UNT Digital Library.

We've focused a bit on interfaces which might be of interest.  I've written a 
bit about it in this post. http://vphill.com/journal/post/6173/

We are generating the clusters on facets from Solr.

Mark
________________________________________
From: Code for Libraries <[email protected]> on behalf of Péter Király 
<[email protected]>
Sent: Wednesday, October 25, 2017 11:18:52 AM
To: [email protected]
Subject: [EXT] Re: [CODE4LIB] clustering techniques for normalizing 
bibliographic data

Hi Eric,

I am planning to work on detecting such anomalities. What I have
thought about so far the following approaches:
- n-gram analysis
- basket analysis
- similarity detection of Solr
- final state automat

The tools I will use: Apache Solr and Apache Spark. I haven't started
yet the implementation.

Best,
Péter


2017-10-25 17:57 GMT+02:00 Eric Lease Morgan <[email protected]>:
> Has anybody here played with any clustering techniques for normalizing 
> bibliographic data?
>
> My bibliographic data is fraught with inconsistencies. For example, a 
> publisher’s name may be recorded one way, another way, or a third way. The 
> same goes for things like publisher place: South Bend; South Bend, IN; South 
> Bend, Ind. And then there is the ISBD punctuation that is sometimes applied 
> and sometimes not. All of these inconsistencies make indexing & faceted 
> browsing more difficult than it needs to be.
>
> OpenRefine is a really good program for finding these inconsistencies and 
> then normalizing them. OpenRefine calls this process “clustering”, and it 
> points to a nice page describing the various clustering processes. [1] Some 
> of the techniques included “fingerprinting” and calculating “nearest 
> neighbors”. Unfortunately, OpenRefine is not really programable, and I’d like 
> to automate much of this process.
>
> Does anybody here have any experience automating the process of normalize 
> bibliographic (MARC) data?
>
> [1] about clustering - http://bit.ly/2izQarE
>
> —
> Eric Morgan



--
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

Reply via email to