Re: [Dspace-devel] [DuraSpace JIRA] (DS-1523) detection of duplicate items during import and submission

Thornton, Susan M. (LARC-B702)[LITES] Tue, 19 Mar 2013 09:58:44 -0700

I developed a JAVA program a few years back that reads in a directory of 
document names, then checks DSpace to see if this document already exists in 
the bitstream table.  There are several actions that take place:


1.  IF the document is NOT found in the bitstream table, it is written to a 
subdirectory entitled “TODO”.

2.  If the document IS found in the bitstream table, it is written to a 
subdirectory entitled “ALREADY-IN-DSPACE”.



We then have 2 directories of documents, one that contains the documents that 
are already in DSpace and one that contains documents that need to 
imported/loaded into DSpace.  I use this program frequently to make sure we’re 
not loading duplicates into DSPACE.



Best regards,

Sue Thornton







Sue Walker-Thornton

(w):  (757) 864-2368

(m):  (757) 506-9903



-----Original Message-----
From: Ivan Masár (DuraSpace JIRA) [mailto:[email protected]]
Sent: Sunday, March 17, 2013 11:47 AM
To: [email protected]
Subject: [Dspace-devel] [DuraSpace JIRA] (DS-1523) detection of duplicate items 
during import and submission



Ivan Masár created DS-1523:

------------------------------



             Summary: detection of duplicate items during import and submission

                 Key: DS-1523

                 URL: https://jira.duraspace.org/browse/DS-1523

             Project: DSpace

          Issue Type: New Feature

            Reporter: Ivan Masár





Users expressed the need for DSpace to detect whether an item they're about to 
import/submit already exists in the repository. This issue is trying to capture 
the requirements for this feature.



The major point here is the definition of a duplicity. Some uses already have a 
strict definition of a duplicity, e.g. an equal value of a metadata field 
(dc.identifier.uuid). Others may depend on similarity of multiple metadata 
fields (e.g. dc.title, dc.issn) which may be expressed by Levenshtein distance 
while the rest may even be different (e.g. different values in 
dc.contributor.autor).



This leads me to the conclusion that we need to provide a way for users to 
define their own method of comparison by means of a plugin. The disadvantage of 
this approach is that checking each imported item against all existing items 
using an user-defined (possibly non-optimally fast) method may slow down import 
and therefore the feature needs to be opt-in. Of course we should provide 
implementations for some commonly used cases, like those mentioned above. The 
input to the comparison method should be the item DSO (so that its metadata and 
bitstreams can be read) with the parent object filled in so that the search can 
be restricted to a community/collection in order to make it possible to reduce 
the search scope.



Here are some recent discussion on this topic:

* 
http://dspace.2283337.n4.nabble.com/KE1019161-Import-Editing-Items-in-DSpace-and-evaluating-the-existence-with-an-other-value-then-the-iD-td4662400.html

* DS-1515

* 
http://dspace.2283337.n4.nabble.com/how-to-filter-the-repeat-items-with-import-tools-td4662729.html



--

This message is automatically generated by JIRA.

If you think it was sent incorrectly, please contact your JIRA administrators 
For more information on JIRA, see: http://www.atlassian.com/software/jira



------------------------------------------------------------------------------

Everyone hates slow websites. So do we.

Make your web apps faster with AppDynamics Download AppDynamics Lite for free 
today:

http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________

Dspace-devel mailing list

[email protected]<mailto:[email protected]>

https://lists.sourceforge.net/lists/listinfo/dspace-devel

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] [DuraSpace JIRA] (DS-1523) detection of duplicate items during import and submission

Reply via email to