Hi Alan,

Automated way of fixing mis-matched authorities would be good. I'm not sure
that I'm aware of this, but does DSpace have a way of fixing authority
records?

In your case, maybe you could have a script that says when there are
multiple's for distinct(text_value, authority, confidence), then what's the
best course of action? If one of those has a high confidence score, then
have that value overwrite the others. Is the one with the 600 confidence
the correct one? Does confidence mean that it came from the authority
backstore, or does confidence mean that a human touched a button?

Or, perhaps another script that looks at all the text_value records that
are missing authority. Have it search authority (likely Orcid) for a match
on text-value, and if there is a very close match, than have it make a
recommended replacement? Then present to the user a whole bunch of
corrections.
Orcid and text_value match = 100%
a
b
c

80% match
50% match
5% match...

Locally we've just gotten an issue where user's are doing the first half of
a submission in xmlui using authority, then switching to jspui to upload
the files (using html5 multi-upload, which they love), and we're just now
discovering that sometimes the authority information from xmlui got lost in
translation, (but author Lastname, First is available).

________________
Peter Dietz
Longsight
www.longsight.com
[email protected]
p: 740-599-5005 x809

On Thu, Aug 4, 2016 at 9:50 AM, Alan Orth <[email protected]> wrote:

> Hi,
>
> Thanks for the responses. Here is one author who has duplicates due to
> mismatched authority and confidences:
>
> dspace=# select count(text_value) from metadatavalue where
> metadata_field_id=3 and text_value='Grace, D.';
>  count
> -------
>    516
>
> dspace=# select distinct text_value, authority, confidence from
> metadatavalue where metadata_field_id=3 and text_value='Grace, D.';
>  text_value |              authority               | confidence
> ------------+--------------------------------------+------------
>  Grace, D.  | 0b4fcbc1-d930-4319-9b4d-ea1553cca70b |          0
>  Grace, D.  | 83a8848e-1651-40df-8453-831eabdee9e0 |          0
>  Grace, D.  | 0b4fcbc1-d930-4319-9b4d-ea1553cca70b |        600
>  Grace, D.  |                                      |         -1
>  Grace, D.  | 0b4fcbc1-d930-4319-9b4d-ea1553cca70b |         -1
>
> @Andrea, you can see this is actually the top author in our
> repository, with 516 records (as in the first SQL command). She has
> something like 462 and 40 items if you look carefully in the authors
> of this particular community:
>
> https://cgspace.cgiar.org/handle/10568/1/search-filter?field=author
> https://cgspace.cgiar.org/handle/10568/1/search-filter?
> field=author&offset=130
>
> @Peter, yes it totally helps, and I've done it for a few high-profile
> authors, but I was hoping to just make Discovery use text_value for
> now, as there's no way to do an automated/batch cleanup for our tens
> of thousands of authors (not to mention, in the future the problem
> will still be there). Unless I suppose this is just growing pains from
> moving an existing DSpace installation into the DSpace 5+ era!
>
> Ciao,
>
> On Thu, Aug 4, 2016 at 3:46 PM, Chris Gray <[email protected]> wrote:
> > This could be a problem with the underlying id stored for each name.  By
> > default, the authority index doesn't assume two names are for the same
> > person unless they are explicitly associated with the same underlying id.
> >
> > An authority record looks like this:
> >
> >       {
> >         "id": "badd7401-12aa-4c5c-8d62-515e5746b9d3",
> >         "field": "dc_contributor_author",
> >         "value": "Author, Random",
> >         "deleted": false,
> >         "creation_date": "2015-11-27T05:46:05.066Z",
> >         "last_modified_date": "2015-11-27T05:46:05.066Z",
> >         "authority_type": "person",
> >         "first_name": "Random",
> >         "last_name": "Author"
> >       },
> >
> > That first "id" element determines what can or can't be considered the
> same
> > person for indexing.
> >
> > We discovered this problem with ORCIDs.  Just because two entries are
> given
> > with the same ORCID, still the entries were given different ids and show
> up
> > in browse lists and facets as multiple authors.  We have to force a
> certain
> > id when adding a new dc.contributor field.
> >
> > On Thursday, August 4, 2016 at 8:08:05 AM UTC-4, Alan Orth wrote:
> >>
> >> Hi,
> >>
> >> We have hundreds or thousands of duplicate authors that have the same
> >> exact text_value, but show as separate authors in Discovery's author
> >> sidebar facet. I have tried a handful of these configuration keys
> >> (with a full discovery reindex after) on DSpace 5.1 but I never see
> >> any change.
> >>
> >> First I tried:
> >>
> >> index.authority.ignore-prefered.dc.contributor.author=true
> >> index.authority.ignore-variants.dc.contributor.author=false
> >>
> >> Then:
> >>
> >> index.authority.ignore=true
> >> index.authority.ignore-prefered=true
> >> index.authority.ignore-variants=true
> >>
> >> Then:
> >>
> >> discovery.index.authority.ignore-prefered.dc.contributor.author=true
> >> discovery.index.authority.ignore-variants=true
> >>
> >> What is the trick to getting Discovery to use author text values for
> >> its indexes? Is this a bug that upgrading to 5.{2,3,4,5} will fix? I'm
> >> going slightly crazy. :)
> >>
> >> Thanks,
> >>
> >> On Mon, Jan 4, 2016 at 11:19 PM, Andrea Bollini <[email protected]>
> wrote:
> >> > Hi all,
> >> > the relevant parameters are
> >> > discovery.browse.authority.ignore-prefered
> >> > discovery.index.authority.ignore-prefered
> >> >
> >> > I was probably partially wrong about the browse behavious as looking
> to
> >> > the code (sorry I have had no chances to make an actual test) the
> >> > metadatavalue is never recorded in the browse index when it is
> authority
> >> > controlled
> >> > see
> >> >
> >> > https://github.com/DSpace/DSpace/blob/master/dspace-api/
> src/main/java/org/dspace/browse/SolrBrowseCreateDAO.java#L186
> >> >
> >> > instead the search (facets) only include the prefered form if you use
> >> > default configuration, see
> >> >
> >> > https://github.com/DSpace/DSpace/blob/master/dspace-api/
> src/main/java/org/dspace/discovery/SolrServiceImpl.java#L1091
> >> >
> >> > it looks also as the browse system is buggy when the ignore-prefered
> is
> >> > set to true as probably nothing is indexed in the browse in this case.
> >> > Probably we have fixed this bug on our dspace-cris fork and we have
> forget
> >> > to back port to the basic dspace
> >> > see
> >> >
> >> > https://github.com/Cineca/DSpace/blob/dspace-5_x_x-cris/
> dspace-api/src/main/java/org/dspace/browse/SolrBrowseCreateDAO.java#L224
> >> >
> >> > about your second question, which is the authority for authors in a
> >> > dspace-cris instance it is the internal researcher pages database and
> the
> >> > ORCID registry, in a standard DSpace istance you can configure the
> ORCID
> >> > registry as first lookup for the authors name, when the item is
> archived the
> >> > orcid record is recorded in a local cache solr based that is what is
> >> > actually used as authority
> >> >
> >> > https://github.com/DSpace/DSpace/blob/master/dspace/
> config/dspace.cfg#L1563
> >> >
> >> > this mean that, at least if you no edit the metadata value directly in
> >> > the database or using the admin edit, you cannot have an authority
> with a
> >> > corresponding value different than the "prefered one". With
> DSpace-CRIS
> >> > where also variants are managed out-of-box this can happen more
> easily.
> >> >
> >> > Best,
> >> > Andrea
> >> >
> >> > ----- Messaggio originale -----
> >> > Da: "Hilton Gibson" <[email protected]>
> >> > A: "Peter Dietz" <[email protected]>
> >> > Cc: "DSpace Technical Support" <[email protected]>
> >> > Inviato: Lunedì, 4 gennaio 2016 16:40:31
> >> > Oggetto: Re: [dspace-tech] Author name alternate spellings
> >> >
> >> >
> >> >
> >> > Hi All,
> >> >
> >> >
> >> > " You can also configure the system (see the discovery.cfg options) to
> >> > ignore the metadatavalue and put in the index only the prefered form
> of a
> >> > name as provided by the authority."
> >> >
> >> >
> >> > 1. Where in "discovery.cfg" is this configured? A github link would
> >> > help.
> >> > 2. Who is the "authority" for authors? Excuse the pun!
> >> >
> >> >
> >> > Regards
> >> >
> >> >
> >> > hg
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > Hilton Gibson
> >> > Stellenbosch University Library
> >> >
> >> > http://orcid.org/0000-0002-2992-208X
> >> >
> >> >
> >> >
> >> >
> >> > On 4 January 2016 at 17:13, Peter Dietz < [email protected] >
> wrote:
> >> >
> >> >
> >> >
> >> > Hi All,
> >> >
> >> >
> >> > Lets say you have an author record that might different values, but
> all
> >> > representing the same person.
> >> > Peter Dietz
> >> > Peter M Dietz
> >> > PM Dietz
> >> > Dietz, Peter M
> >> > Dietz, PM
> >> >
> >> >
> >> > Does DSpace have any facility to map these distinct values to the same
> >> > author, such that a search for "Peter Dietz", will match when the
> value is
> >> > stored as "Dietz, PM". Or, is the recommendation just to edit the
> metadata,
> >> > and consolidate the distinct values into one single acceptable form.
> >> > (Probably the longest version of a name).
> >> >
> >> >
> >> > My read of "Authority" is that it will give you a backstore, but I
> don't
> >> > see it as being a mapping of different ways of spelling the same
> thing.
> >> >
> >> >
> >> > Perhaps this is another need for richer metadata objects?
> >> > name.givenname, name.surname, ...
> >> >
> >> >
> >> >
> >> > ________________
> >> > Peter Dietz
> >> > Longsight
> >> > www.longsight.com
> >> > [email protected]
> >> > p: 740-599-5005 x809
> >> >
> >> > --
> >> > You received this message because you are subscribed to the Google
> >> > Groups "DSpace Technical Support" group.
> >> > To unsubscribe from this group and stop receiving emails from it, send
> >> > an email to [email protected] .
> >> > To post to this group, send email to [email protected] .
> >> > Visit this group at https://groups.google.com/group/dspace-tech .
> >> > For more options, visit https://groups.google.com/d/optout .
> >> >
> >> >
> >> >
> >> > --
> >> > You received this message because you are subscribed to the Google
> >> > Groups "DSpace Technical Support" group.
> >> > To unsubscribe from this group and stop receiving emails from it, send
> >> > an email to [email protected] .
> >> > To post to this group, send email to [email protected] .
> >> > Visit this group at https://groups.google.com/group/dspace-tech .
> >> > For more options, visit https://groups.google.com/d/optout .
> >> >
> >> > --
> >> > You received this message because you are subscribed to the Google
> >> > Groups "DSpace Technical Support" group.
> >> > To unsubscribe from this group and stop receiving emails from it, send
> >> > an email to [email protected].
> >> > To post to this group, send email to [email protected].
> >> > Visit this group at https://groups.google.com/group/dspace-tech.
> >> > For more options, visit https://groups.google.com/d/optout.
> >>
> >>
> >>
> >> --
> >> Alan Orth
> >> [email protected]
> >> https://englishbulgaria.net
> >> https://alaninkenya.org
> >> https://mjanja.ch
> >> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
> >> GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "DSpace Technical Support" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > To post to this group, send email to [email protected].
> > Visit this group at https://groups.google.com/group/dspace-tech.
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
> Alan Orth
> [email protected]
> https://englishbulgaria.net
> https://alaninkenya.org
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
> GPG public key ID: 0x8cb0d0acb5cd81ec209c6cdfbd1a0e09c2f836c0
>

-- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to