Hi all--Karen mentions:

>> the big question was using the relators as properties and the object as a 
>> string. There are folks who need to do that, and it is a shame that there 
>> isn't an unconstrained version that would allow this, since the LoC list is 
>> the most complete of all lists we can find. 

Could RDA/RDF unconstrained properties be helpful for such use cases? I'd 
expect this to also be a fairly complete list.

Looking at a very small, random sample of relator terms vs. RDA unconstrained 
properties to get some idea of coverage:

Abridger / http://id.loc.gov/vocabulary/relators/abr >> has abridger / 
http://rdaregistry.info/Elements/u/P60394
Enacting jurisdiction / http://id.loc.gov/vocabulary/relators/enj >> (perhaps 
has enacting government / http://rdaregistry.info/Elements/u/P60096 isn't quite 
the same thing, and so no coverage here?)
Inscriber / http://id.loc.gov/vocabulary/relators/ins >> has inscriber / 
http://rdaregistry.info/Elements/u/P60460
Libelee-appellant / http://id.loc.gov/vocabulary/relators/let >> (might not 
have coverage here--I only see has appellant / 
http://rdaregistry.info/Elements/u/P60457)
Music programmer / http://id.loc.gov/vocabulary/relators/mup >> has music 
programmer / http://rdaregistry.info/Elements/u/P60894
Redaktor / http://id.loc.gov/vocabulary/relators/red >> (I don't see any 
coverage here...)
Research team head / http://id.loc.gov/vocabulary/relators/rth >> (lacking a 
direct equivalent - I only see has research supervisor / 
http://rdaregistry.info/Elements/u/P61098)
Storyteller / http://id.loc.gov/vocabulary/relators/stl >> has storyteller / 
http://rdaregistry.info/Elements/u/P60154
Visual effects provider / http://id.loc.gov/vocabulary/relators/vfx >> has 
visual effects provider / http://rdaregistry.info/Elements/u/P60748
Writer of preface / http://id.loc.gov/vocabulary/relators/wpr >> (note that 
RDAU 'has writer of preface' is now deprecated, I'd guess as part of the 3R LRM 
alignment work, so no coverage for this relator)

Looking at modelling for RDAU properties--RDF/XML downloaded from RDA Registry 
at https://www.rdaregistry.info/Elements/u/ , serialized here as Turtle for 
readability:

# take for example 'has abridger'
# omitting non-English labels, definition, and scope notes here

<http://rdaregistry.info/Elements/u/P60394> a rdf:Property ;
    rdfs:label  "has abridger"@en ;
    rdakit:seeAlso <http://rdaregistry.info/Elements/u/P60434> ;
    reg:lexicalAlias <http://rdaregistry.info/Elements/u/abridger.en> ;
    reg:status <http://metadataregistry.org/uri/RegStatus/1001> ;
    rdfs:isDefinedBy <http://rdaregistry.info/Elements/u/> ;
    rdfs:subPropertyOf <http://rdaregistry.info/Elements/u/P60398> ;
    owl:inverseOf <http://rdaregistry.info/Elements/u/P60622> ;
    skos:definition "Relates a resource to an agent who contributes to a 
resource by shortening a resource of a related resource without changing the 
general meaning or manner of presentation."@en ;
    skos:scopeNote "Substantial modification that results in the creation of a 
new resource is excluded."@en .

Benjamin Riesenberg
=========
they/them
Metadata Librarian, Cataloging and Metadata Services, University of Washington 
Libraries
📧 rie...@uw.edu
☎️ 34675 / (206) 543-4675
=========
Monday on campus
Tuesday on campus
Wednesday remote
Thursday on campus and/or remote
Friday remote

-----Original Message-----
From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of CODE4LIB 
automatic digest system
Sent: Monday, October 23, 2023 8:56 AM
To: CODE4LIB@LISTS.CLIR.ORG
Subject: CODE4LIB Digest - 20 Oct 2023 to 23 Oct 2023 - Special issue 
(#2023-240)

There are 5 messages totaling 18361 lines in this issue.

Topics in this special issue:

  1. [External] [CODE4LIB] Question about multiple declarations (2)
  2. Deduping with finesse (2)
  3. Digital Initiatives Symposium 2024

----------------------------------------------------------------------

Date:    Mon, 23 Oct 2023 07:19:49 -0700
From:    Karen Coyle <li...@kcoyle.net>
Subject: Re: [External] [CODE4LIB] Question about multiple declarations

Thanks, Kevin. My question, originally, was whether the typing assigned can be 
seen as "OR" or "AND". I know that you can define SKOS entities as objects and 
as properties and these are not seen as being in conflict, but SKOS is very 
clear in defining this, making sure that it is open. In the LoC case, it is an 
OWL declaration of ObjectProperty and the class Role, a kind of punning. It 
seems to me that all of the declarations are always attached to the subject, 
and therefore using them as objects would trigger inferencing inconsistencies 
(OWL tends to be strict). Have you tried that? Or are you eschewing 
inferencing, as one often does.

In any case, the big question was using the relators as properties and the 
object as a string. There are folks who need to do that, and it is a shame that 
there isn't an unconstrained version that would allow this, since the LoC list 
is the most complete of all lists we can find. 
Declaration as an rdf:Property would do that, and that would entail less "rule" 
on the property definition, while users could define their own more strict 
rules for their application. Again, this brings up how far you can go with 
punning - adding rdf:Property to the mix would probably just make things more 
confusing.

I vote for simpler and less constrained at the vocabulary level, leaving 
constraints to the application profile level, so everyone can have the usage 
they need.

kc


On 10/20/23 11:23 AM, Ford, Kevin wrote:
> Hi Karen,
>
> Steve is not wrong, but I think you are talking about two different things.
>
> Using a string with a Relators property would not conform to how they have 
> been defined at ID.LOC.GOV.  So, the answer to your specific question is: no, 
> it is not our expectation Relator URIs would be used as properties with the 
> object of the triple being either a URI or a string.  Only URIs.
>
> But the Relators URIs have also been defined such that they can be used as a 
> Property or as an Object, which is what Steve was driving at.  We use them as 
> Objects in Bibframe, hence their (additional) typing as a bf:Role.
>
> HTH,
> Kevin
>
> --
> Kevin Ford
> Network Development and MARC Standards Office Library of Congress 
> Washington, DC
>
>
> -----Original Message-----
> From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of Karen 
> Coyle
> Sent: Friday, October 20, 2023 11:41 AM
> To: CODE4LIB@LISTS.CLIR.ORG
> Subject: Re: [CODE4LIB] [External] [CODE4LIB] Question about multiple 
> declarations
>
> CAUTION: This email message has been received from an external source. Please 
> use caution when opening attachments, or clicking on links.
>
> Steve, the list doesn't need to hear this, but you are not correct here.
> The relators are defined as owl:ObjectProperties (not just "properties") 
> which means that they cannot take text as objects. However, I want LoC to 
> confirm that, because this is their doing.
>
> kc
>
>
> On 10/17/23 8:17 AM, McDonald, Stephen wrote:
>> It is an inherent problem when creating a vocabulary--should this set of 
>> traits be properties or types? Whichever choice you make, you face the 
>> problem that other vocabularies may choose differently. I believe this 
>> vocabulary defines relators as properties. But they also want to show how 
>> the terms are related to terms in OWL and BIBFRAME where they are defined as 
>> types.
>>
>>                                        Steve McDonald
>>                                        steve.mcdon...@tufts.edu
>>
>>
>>> -----Original Message-----
>>> From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of 
>>> Karen Coyle
>>> Sent: Tuesday, October 17, 2023 10:40 AM
>>> To: CODE4LIB@LISTS.CLIR.ORG
>>> Subject: Re: [CODE4LIB] [External] [CODE4LIB] Question about 
>>> multiple declarations
>>>
>>> tl;dr: Does LoC intend that its relator properties be used with both 
>>> "thing" and "string" objects?
>>>
>>> kc
>>>
>>>
>>> On 10/10/23 8:02 AM, McDonald, Stephen wrote:
>>>> That is not correct.  The statement
>>>>     <rdfs:subPropertyOf
>>>>     rdf:resource="http://purl.org/dc/elements/1.1/contributor"/>
>>>>
>>>> is a single predicate-object statement, enclosed within angle brackets.
>>>> The following statement
>>>> <rdf:type
>>>> rdf:resource="http://www.w3.org/2002/07/owl#ObjectProperty"/>
>>>>
>>>> is also separate statement, enclosed within angle brackets. The OWL
>>> statement is not part of the subPropertyOf statement. The next 
>>> statement is also a separate statement. So we have three statements:
>>>> subPropertyOf: DC contributor
>>>> type: owl ObjectProperty
>>>> type: BIBFRAME role
>>>>
>>>> The term you were looking up is the implied subject of the 
>>>> statements,
>>> making these RDF triples.
>>>>                                      Steve McDonald
>>>>                                      steve.mcdon...@tufts.edu
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of 
>>>>> Karen Coyle
>>>>> Sent: Monday, October 9, 2023 5:36 PM
>>>>> To: CODE4LIB@LISTS.CLIR.ORG
>>>>> Subject: [External] [CODE4LIB] Question about multiple 
>>>>> declarations
>>>>>
>>>>> All,
>>>>>
>>>>> I am looking at the LoC relators at id.loc.gov, and am trying to 
>>>>> understand the implications of the multiple declarations for relator 
>>>>> terms.
>>>>>
>>>>> <rdfs:subPropertyOf
>>>>> rdf:resource="http://purl.org/dc/elements/1.1/contributor"/>
>>>>> <rdf:type
>>>>> rdf:resource="http://www.w3.org/2002/07/owl#ObjectProperty"/>
>>>>> <rdf:type
>>>>> rdf:resource="http://id.loc.gov/ontologies/bibframe/Role"/>
>>>>>
>>>>> dct:contributor is not an Object Property; there is no object type 
>>>>> given, so I suppose it is de facto an Annotation Property. I read 
>>>>> the next statement as narrowing, so at statement 2 we have:
>>>>>       subproperty of dct:contributor AND an owl:ObjectProperty
>>>>>
>>>>> If my reading is correct, it would be a violation of this to use 
>>>>> the relator with a string rather than a thing.
>>>>>
>>>>> (Stop me here if I'm wrong.)
>>>>>
>>>>> Then the 3rd statement appears to say that the relator is a 
>>>>> bf:Role, which is a BIBFRAME-specific class. I can't wrap my head 
>>>>> around the functionality of this statement and would love a brief 
>>>>> explanation.
>>>>> I'm undoubtedly not into BIBFRAME deep enough to grok this.
>>>>>
>>>>> Also, my reading is that each relator is ALL THREE OF THESE; this 
>>>>> is an AND not at OR. Right?
>>>>>
>>>>> Thanks for any help,
>>>>> kc
>>>>>
>>>>> --
>>>>> Karen Coyle
>>>>> kco...@kcoyle.net
>>>>> https://urldefense.com/v3/__http://kcoyle.net__;!!EDx7F7x-0XSOB8YS
>>>>> _ 
>>>>> BQ!eHPXLOmgHd34Nkhl7hC1y1HksSXx1U6hRMICVD7hgM2VshIAMS7KC8rwlhpiRDM
>>>>> c
>>>>> J39slRBrXwrxVIJV$
>>>>> m: +1-510-435-8234
>>>>> skype: kcoylenet/+1-510-984-3600
>>>>>
>>>>> Caution: This message originated from outside of the Tufts 
>>>>> University organization. Please exercise caution when clicking 
>>>>> links or opening attachments. When in doubt, email the TTS Service 
>>>>> Desk at i...@tufts.edu<mailto:i...@tufts.edu> or call them directly at 
>>>>> 617-627-3376.
>>> --
>>> Karen Coyle
>>> kco...@kcoyle.net
>>> https://urldefense.com/v3/__http://kcoyle.net__;!!EDx7F7x-0XSOB8YS_B
>>> Q 
>>> !eHPXLOmgHd34Nkhl7hC1y1HksSXx1U6hRMICVD7hgM2VshIAMS7KC8rwlhpiRDMcJ39
>>> s
>>> lRBrXwrxVIJV$
>>> m: +1-510-435-8234
>>> skype: kcoylenet/+1-510-984-3600
> --
> Karen Coyle
> kco...@kcoyle.net
> https://urldefense.com/v3/__http://kcoyle.net__;!!EDx7F7x-0XSOB8YS_BQ!
> eHPXLOmgHd34Nkhl7hC1y1HksSXx1U6hRMICVD7hgM2VshIAMS7KC8rwlhpiRDMcJ39slR
> BrXwrxVIJV$

--
Karen Coyle
kco...@kcoyle.net
https://urldefense.com/v3/__http://kcoyle.net__;!!K-Hz7m0Vt54!hMnOycGdoW5lta2TAs4r8dCWW5DvQGKVVt20n0IhK5XAaQZ7F6encZ6qO0T-omjyptWDC4D77H1ngOKNjKM$
 

------------------------------

Date:    Mon, 23 Oct 2023 08:05:46 -0700
From:    Karen Coyle <li...@kcoyle.net>
Subject: Re: [External] [CODE4LIB] Question about multiple declarations

Ah, forget the first paragraph. I just found the section in the (very confusing 
- OWL DL? 2? ugh) documentation where they specifically allow ObjectProperty 
and class. But I do want to continue (or at least
emphasize) the question of constraining the relators to ObjectProperties. I 
honestly do think that such a choice should be up to the folks using the 
vocabulary, based on their needs. If BIBFRAME wants to require IRIs as objects 
that's fine. But I see the LoC vocabularies as not being limited to BIBFRAME - 
or at least, I think that would be a good approach.

YMMV.

kc

On 10/23/23 7:19 AM, Karen Coyle wrote:
> Thanks, Kevin. My question, originally, was whether the typing 
> assigned can be seen as "OR" or "AND". I know that you can define SKOS 
> entities as objects and as properties and these are not seen as being 
> in conflict, but SKOS is very clear in defining this, making sure that 
> it is open. In the LoC case, it is an OWL declaration of 
> ObjectProperty and the class Role, a kind of punning. It seems to me 
> that all of the declarations are always attached to the subject, and 
> therefore using them as objects would trigger inferencing 
> inconsistencies (OWL tends to be strict). Have you tried that? Or are 
> you eschewing inferencing, as one often does.
>
> In any case, the big question was using the relators as properties and 
> the object as a string. There are folks who need to do that, and it is 
> a shame that there isn't an unconstrained version that would allow 
> this, since the LoC list is the most complete of all lists we can 
> find. Declaration as an rdf:Property would do that, and that would 
> entail less "rule" on the property definition, while users could 
> define their own more strict rules for their application. Again, this 
> brings up how far you can go with punning - adding rdf:Property to the 
> mix would probably just make things more confusing.
>
> I vote for simpler and less constrained at the vocabulary level, 
> leaving constraints to the application profile level, so everyone can 
> have the usage they need.
>
> kc
>
>
> On 10/20/23 11:23 AM, Ford, Kevin wrote:
>> Hi Karen,
>>
>> Steve is not wrong, but I think you are talking about two different 
>> things.
>>
>> Using a string with a Relators property would not conform to how they 
>> have been defined at ID.LOC.GOV.  So, the answer to your specific 
>> question is: no, it is not our expectation Relator URIs would be used 
>> as properties with the object of the triple being either a URI or a 
>> string.  Only URIs.
>>
>> But the Relators URIs have also been defined such that they can be 
>> used as a Property or as an Object, which is what Steve was driving 
>> at.  We use them as Objects in Bibframe, hence their (additional) 
>> typing as a bf:Role.
>>
>> HTH,
>> Kevin
>>
>> --
>> Kevin Ford
>> Network Development and MARC Standards Office Library of Congress 
>> Washington, DC
>>
>>
>> -----Original Message-----
>> From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of Karen 
>> Coyle
>> Sent: Friday, October 20, 2023 11:41 AM
>> To: CODE4LIB@LISTS.CLIR.ORG
>> Subject: Re: [CODE4LIB] [External] [CODE4LIB] Question about multiple 
>> declarations
>>
>> CAUTION: This email message has been received from an external 
>> source. Please use caution when opening attachments, or clicking on 
>> links.
>>
>> Steve, the list doesn't need to hear this, but you are not correct here.
>> The relators are defined as owl:ObjectProperties (not just 
>> "properties") which means that they cannot take text as objects. 
>> However, I want LoC to confirm that, because this is their doing.
>>
>> kc
>>
>>
>> On 10/17/23 8:17 AM, McDonald, Stephen wrote:
>>> It is an inherent problem when creating a vocabulary--should this 
>>> set of traits be properties or types? Whichever choice you make, you 
>>> face the problem that other vocabularies may choose differently. I 
>>> believe this vocabulary defines relators as properties. But they 
>>> also want to show how the terms are related to terms in OWL and 
>>> BIBFRAME where they are defined as types.
>>>
>>>                                        Steve McDonald
>>> steve.mcdon...@tufts.edu
>>>
>>>
>>>> -----Original Message-----
>>>> From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of Karen
>>>> Coyle
>>>> Sent: Tuesday, October 17, 2023 10:40 AM
>>>> To: CODE4LIB@LISTS.CLIR.ORG
>>>> Subject: Re: [CODE4LIB] [External] [CODE4LIB] Question about multiple
>>>> declarations
>>>>
>>>> tl;dr: Does LoC intend that its relator properties be used with both
>>>> "thing" and "string" objects?
>>>>
>>>> kc
>>>>
>>>>
>>>> On 10/10/23 8:02 AM, McDonald, Stephen wrote:
>>>>> That is not correct.  The statement
>>>>>     <rdfs:subPropertyOf
>>>>> rdf:resource="http://purl.org/dc/elements/1.1/contributor"/>
>>>>>
>>>>> is a single predicate-object statement, enclosed within angle 
>>>>> brackets.
>>>>> The following statement
>>>>> <rdf:type
>>>>> rdf:resource="http://www.w3.org/2002/07/owl#ObjectProperty"/>
>>>>>
>>>>> is also separate statement, enclosed within angle brackets. The OWL
>>>> statement is not part of the subPropertyOf statement. The next
>>>> statement is also a separate statement. So we have three statements:
>>>>> subPropertyOf: DC contributor
>>>>> type: owl ObjectProperty
>>>>> type: BIBFRAME role
>>>>>
>>>>> The term you were looking up is the implied subject of the
>>>>> statements,
>>>> making these RDF triples.
>>>>> Steve McDonald
>>>>> steve.mcdon...@tufts.edu
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of
>>>>>> Karen Coyle
>>>>>> Sent: Monday, October 9, 2023 5:36 PM
>>>>>> To: CODE4LIB@LISTS.CLIR.ORG
>>>>>> Subject: [External] [CODE4LIB] Question about multiple declarations
>>>>>>
>>>>>> All,
>>>>>>
>>>>>> I am looking at the LoC relators at id.loc.gov, and am trying to
>>>>>> understand the implications of the multiple declarations for 
>>>>>> relator terms.
>>>>>>
>>>>>> <rdfs:subPropertyOf
>>>>>> rdf:resource="http://purl.org/dc/elements/1.1/contributor"/>
>>>>>> <rdf:type
>>>>>> rdf:resource="http://www.w3.org/2002/07/owl#ObjectProperty"/>
>>>>>> <rdf:type
>>>>>> rdf:resource="http://id.loc.gov/ontologies/bibframe/Role"/>
>>>>>>
>>>>>> dct:contributor is not an Object Property; there is no object type
>>>>>> given, so I suppose it is de facto an Annotation Property. I read
>>>>>> the next statement as narrowing, so at statement 2 we have:
>>>>>>       subproperty of dct:contributor AND an owl:ObjectProperty
>>>>>>
>>>>>> If my reading is correct, it would be a violation of this to use
>>>>>> the relator with a string rather than a thing.
>>>>>>
>>>>>> (Stop me here if I'm wrong.)
>>>>>>
>>>>>> Then the 3rd statement appears to say that the relator is a
>>>>>> bf:Role, which is a BIBFRAME-specific class. I can't wrap my head
>>>>>> around the functionality of this statement and would love a brief 
>>>>>> explanation.
>>>>>> I'm undoubtedly not into BIBFRAME deep enough to grok this.
>>>>>>
>>>>>> Also, my reading is that each relator is ALL THREE OF THESE; this
>>>>>> is an AND not at OR. Right?
>>>>>>
>>>>>> Thanks for any help,
>>>>>> kc
>>>>>>
>>>>>> -- 
>>>>>> Karen Coyle
>>>>>> kco...@kcoyle.net
>>>>>> https://urldefense.com/v3/__http://kcoyle.net__;!!EDx7F7x-0XSOB8YS_
>>>>>> BQ!eHPXLOmgHd34Nkhl7hC1y1HksSXx1U6hRMICVD7hgM2VshIAMS7KC8rwlhpiRDMc
>>>>>> J39slRBrXwrxVIJV$
>>>>>> m: +1-510-435-8234
>>>>>> skype: kcoylenet/+1-510-984-3600
>>>>>>
>>>>>> Caution: This message originated from outside of the Tufts
>>>>>> University organization. Please exercise caution when clicking
>>>>>> links or opening attachments. When in doubt, email the TTS Service
>>>>>> Desk at i...@tufts.edu<mailto:i...@tufts.edu> or call them directly 
>>>>>> at 617-627-3376.
>>>> -- 
>>>> Karen Coyle
>>>> kco...@kcoyle.net
>>>> https://urldefense.com/v3/__http://kcoyle.net__;!!EDx7F7x-0XSOB8YS_BQ
>>>> !eHPXLOmgHd34Nkhl7hC1y1HksSXx1U6hRMICVD7hgM2VshIAMS7KC8rwlhpiRDMcJ39s
>>>> lRBrXwrxVIJV$
>>>> m: +1-510-435-8234
>>>> skype: kcoylenet/+1-510-984-3600
>> -- 
>> Karen Coyle
>> kco...@kcoyle.net
>> https://urldefense.com/v3/__http://kcoyle.net__;!!EDx7F7x-0XSOB8YS_BQ!eHPXLOmgHd34Nkhl7hC1y1HksSXx1U6hRMICVD7hgM2VshIAMS7KC8rwlhpiRDMcJ39slRBrXwrxVIJV$
>>  
>>
>
-- 
Karen Coyle
kco...@kcoyle.net
https://urldefense.com/v3/__http://kcoyle.net__;!!K-Hz7m0Vt54!hMnOycGdoW5lta2TAs4r8dCWW5DvQGKVVt20n0IhK5XAaQZ7F6encZ6qO0T-omjyptWDC4D77H1ngOKNjKM$
 

------------------------------

Date:    Mon, 23 Oct 2023 11:11:50 -0400
From:    Emily Lavins <lavi...@bc.edu>
Subject: Deduping with finesse

Hello Code4Lib,

I received a question about deduping from one of our archivists and I'm
wondering if anyone has any experience/recommendations for this sort of
thing.

In short: We received a hard drive that has massive amounts of duplicates,
and they are starting the process of deduping and arranging it. They want
somewhat finer control over which duplicates get retained (currently using
FSlint and Bitcurator), so they can ensure 'complete sets' of files are
retained. But it'd be great to not have to manually select *every* dedup
preference in FSlint.

For example:
1. There is at least one folder that contains numbered audio tracks. When
we ran fslint raw, a few of these got deduped in favor of other copies in
the filesystem. But it would have been preferred to keep these together.
2. If there is a directory in which most of the working files were
originally created together.
3. We'd also generally prefer to keep the copies that will *not* result in,
post-dedup, folders containing only a single file scattered throughout the
directory.

Hopefully some of that makes sense. Has anyone found any helpful workflows
for streamlining the deduping/arranging process?

All I could come up with is logging all of FSlint's decisions, so that any
undesirable dedups could be more easily be tracked/reversed later, but I
really just don't know enough about any of this.

Thank you very much for your time and thoughts.

All the best,
Emily


-- 
Emily Lavins
Associate Systems Librarian
Boston College Libraries

------------------------------

Date:    Mon, 23 Oct 2023 15:20:23 +0000
From:    Scott Prater <scott.pra...@wisc.edu>
Subject: Re: Deduping with finesse

Hello, Emily --

As a first pass, you may want to create and record checksums for all the files 
on the hard drive, then examine which checksums are identical.  Those files 
will be bit-for-bit exact copies of each other, and can be safely deduped.

This technique won't catch the files where the content is substantially the 
same, except for insignificant changes (an embedded date stamp, for example), 
but it may get you some ways down the path.

-- Scott

-- 
Scott Prater
Digital Library Architect
UW Digital Collections Center
University of Wisconsin - Madison

-----Original Message-----
From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of Emily Lavins
Sent: Monday, October 23, 2023 10:12 AM
To: CODE4LIB@LISTS.CLIR.ORG
Subject: [CODE4LIB] Deduping with finesse

Hello Code4Lib,

I received a question about deduping from one of our archivists and I'm 
wondering if anyone has any experience/recommendations for this sort of thing.

In short: We received a hard drive that has massive amounts of duplicates, and 
they are starting the process of deduping and arranging it. They want somewhat 
finer control over which duplicates get retained (currently using FSlint and 
Bitcurator), so they can ensure 'complete sets' of files are retained. But it'd 
be great to not have to manually select *every* dedup preference in FSlint.

For example:
1. There is at least one folder that contains numbered audio tracks. When we 
ran fslint raw, a few of these got deduped in favor of other copies in the 
filesystem. But it would have been preferred to keep these together.
2. If there is a directory in which most of the working files were originally 
created together.
3. We'd also generally prefer to keep the copies that will *not* result in, 
post-dedup, folders containing only a single file scattered throughout the 
directory.

Hopefully some of that makes sense. Has anyone found any helpful workflows for 
streamlining the deduping/arranging process?

All I could come up with is logging all of FSlint's decisions, so that any 
undesirable dedups could be more easily be tracked/reversed later, but I really 
just don't know enough about any of this.

Thank you very much for your time and thoughts.

All the best,
Emily


--
Emily Lavins
Associate Systems Librarian
Boston College Libraries

------------------------------

Date:    Mon, 23 Oct 2023 09:02:26 -0700
From:    Amanda Makula <amak...@sandiego.edu>
Subject: Digital Initiatives Symposium 2024

[image: save the date.png]



*CALL FOR PROPOSALS: *The 2024 Digital Initiatives Symposium (DIS) at the
University of San Diego

This year’s conference – celebrating the 10-year anniversary of the DIS –
is a full two-day live event with workshops and concurrent sessions on Day
1, and keynote, featured, and invited speakers on Day 2.

The DIS is now accepting proposals for its concurrent sessions, scheduled
for the afternoon of Monday, April 29, 2024 at the University of San Diego.


We welcome proposals from a wide variety of organizations, including
colleges and universities of all sizes, community colleges, public
libraries, special libraries, museums, and other cultural memory
institutions. Concurrent sessions will be 40 minutes in length (please
allow 10-15 minutes for Q&A) and are limited to 1-2 speakers. This year we
are particularly interested in receiving proposals about: AI, data science,
diversity and digital collections, controlled digital lending, collection
audits, new OA initiatives, and relevant legislation.


*For full submission information, and to submit a proposal, please go to: *
https://urldefense.com/v3/__https://digital.sandiego.edu/symposium/__;!!K-Hz7m0Vt54!hMnOycGdoW5lta2TAs4r8dCWW5DvQGKVVt20n0IhK5XAaQZ7F6encZ6qO0T-omjyptWDC4D77H1nFAY5jdw$
  and click on *Submit Proposal *on
the left side column.

Proposal Submission Deadline: Friday, Dec. 15, 2023.

Questions? Contact digi...@sandiego.edu

-- 

Cheers,


*Amanda Y. Makula *(she/ella)

Associate Professor

Digital Initiatives Librarian

University of San Diego

5998 Alcalá Park

San Diego, CA 92110-2492

Phone: (619) 260-6850

amak...@sandiego.edu

Open access publishing at Digital USD 
<https://urldefense.com/v3/__http://digital.sandiego.edu__;!!K-Hz7m0Vt54!hMnOycGdoW5lta2TAs4r8dCWW5DvQGKVVt20n0IhK5XAaQZ7F6encZ6qO0T-omjyptWDC4D77H1nd4yq2BY$
 >

------------------------------

End of CODE4LIB Digest - 20 Oct 2023 to 23 Oct 2023 - Special issue (#2023-240)
*******************************************************************************

Reply via email to