I developed a JAVA program a few years back that reads in a directory of
document names, then checks DSpace to see if this document already exists in
the bitstream table. There are several actions that take place:
1. IF the document is NOT found in the bitstream table, it is written to a
subdirectory entitled “TODO”.
2. If the document IS found in the bitstream table, it is written to a
subdirectory entitled “ALREADY-IN-DSPACE”.
We then have 2 directories of documents, one that contains the documents that
are already in DSpace and one that contains documents that need to
imported/loaded into DSpace. I use this program frequently to make sure we’re
not loading duplicates into DSPACE.
Best regards,
Sue Thornton
Sue Walker-Thornton
(w): (757) 864-2368
(m): (757) 506-9903
-----Original Message-----
From: Ivan Masár (DuraSpace JIRA) [mailto:[email protected]]
Sent: Sunday, March 17, 2013 11:47 AM
To: [email protected]
Subject: [Dspace-devel] [DuraSpace JIRA] (DS-1523) detection of duplicate items
during import and submission
Ivan Masár created DS-1523:
------------------------------
Summary: detection of duplicate items during import and submission
Key: DS-1523
URL: https://jira.duraspace.org/browse/DS-1523
Project: DSpace
Issue Type: New Feature
Reporter: Ivan Masár
Users expressed the need for DSpace to detect whether an item they're about to
import/submit already exists in the repository. This issue is trying to capture
the requirements for this feature.
The major point here is the definition of a duplicity. Some uses already have a
strict definition of a duplicity, e.g. an equal value of a metadata field
(dc.identifier.uuid). Others may depend on similarity of multiple metadata
fields (e.g. dc.title, dc.issn) which may be expressed by Levenshtein distance
while the rest may even be different (e.g. different values in
dc.contributor.autor).
This leads me to the conclusion that we need to provide a way for users to
define their own method of comparison by means of a plugin. The disadvantage of
this approach is that checking each imported item against all existing items
using an user-defined (possibly non-optimally fast) method may slow down import
and therefore the feature needs to be opt-in. Of course we should provide
implementations for some commonly used cases, like those mentioned above. The
input to the comparison method should be the item DSO (so that its metadata and
bitstreams can be read) with the parent object filled in so that the search can
be restricted to a community/collection in order to make it possible to reduce
the search scope.
Here are some recent discussion on this topic:
*
http://dspace.2283337.n4.nabble.com/KE1019161-Import-Editing-Items-in-DSpace-and-evaluating-the-existence-with-an-other-value-then-the-iD-td4662400.html
* DS-1515
*
http://dspace.2283337.n4.nabble.com/how-to-filter-the-repeat-items-with-import-tools-td4662729.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics Download AppDynamics Lite for free
today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dspace-devel mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/dspace-devel
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel