[dspace-community] Introducing FedHarv

Pascal Calarco Fri, 27 Mar 2026 12:48:51 -0700

Hi folks,

I am releasing a set of Python scripts I have been working on since last 
late November called FedHarv (short for federated harvesting). Its 
available now publicly under an AGPL v.3 license for all to use, modify and 
build upon, provided it stays as free and open source software.


https://github.com/pvcalarco/FedHarv

FedHarv is a sophisticated, production-ready federated harvester for open 
access academic content, designed to automatically discover, enrich, and 
harvest scholarly articles with PDF availability from multiple sources. 

The problem we are trying to provide a solution for is to to the extent 
possible, identify Creative Commons-licensed scholarly works (journal 
articles, letters to the editor, retractions, errata, book chapters, 
conference proceedings, and open access books) that are authored by 
researchers, faculty and students of an institution of higher education or 
research, harvest the metadata and associated PDF from a variety of API 
services. Where we can't find a non-paywalled version, we use Unpaywall 
to identify author manuscripts and preprints that can be deposited.

The script then provides these metadata and PDFs in a series of folders for 
the repository manager to quickly check (for departmental and institutional 
affiliation and CC license correctness), package these up into Simple 
Archive Format (SAF), ready for batch ingest into DSpace institutional 
repositories.

The harvester isn't perfect and you should still check to make sure closed 
or bronze OA items were not harvested in error, but the author has made 
every effort to do so and has encountered few such errors after much 
iteration over this.

With this tool, you'll be able to gather together as much of the Open 
Access scholarly works that your community has formally written and legally 
deposit these into your organization's institutional repository. If you 
find this software useful, please drop me an email! 

## 🤖 AI Assistance & Authorship Disclosure

**FedHarv** was designed, architected, and verified by **Pascal Calarco**.

During the development process, AI-augmented coding tools (Google Gemini 
and GitHub Copilot) were utilized to:
* Generate boilerplate code and initial function structures.
* Refactor logic for performance (e.g., implementing multi-threading).
* Assist with documentation, licensing (AGPL-v3), and testing suites.

All AI-generated suggestions have been manually reviewed, tested, and 
integrated by the author to ensure technical accuracy,
scholarly metadata standards, and adherence to best practices in library 
and information science.

All best wishes,

Pascal




 

 

*Pascal Calarco*¦ Scholarly Communications Librarian and Systems Librarian

Lead, Discovery Team

Research & Publishing Services Unit
Librarian IV

University of Windsor ¦ J. Francis Leddy Library
401 Sunset Avenue ¦ Windsor, Ontario   N9B 3P4
(519)-253-3000 ¦ leddy.uwindsor.ca

 

 

*The University of Windsor is situated on the traditional territory of the 
Three Fires Confederacy of First Nations: the Ojibwa, the Odawa, and the 
Potawatomi.*

 

*Join the fight for post-secondary education at Education2025.ca.*  

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://lyrasis.org/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/dspace-community/f5656e2c-3750-44e4-9070-859f13b2ae5fn%40googlegroups.com.

[dspace-community] Introducing FedHarv

Reply via email to