As an example where the current behaviour of GWTools allowing
duplicates is a problem; my NYPL uploads are very large files (up to
~300MB images), unfortunately there are instances where the library is
giving multiple identities to the same scanned image:
Three identical duplicates of a map as uploaded by GWT, two must be
deleted at some point:
1. 
https://commons.wikimedia.org/wiki/File:Carta_dell%27_Egitto,_Sudan,_Mar_Rosso,_Assab,_Massaua,_Abissinia_ecc._NYPL1690373.tiff
2. 
https://commons.wikimedia.org/wiki/File:Carta_general_del_Oceano_Atlantico_%C3%BA_ocidental_desde_52%C2%BA_de_latitud_norte_hasta_el_Equador_-_construida_de_orden_del_Rey_en_el_Deposito_Hidrografico_de_Marina_y_presentada_%C3%A1_S.M._por_mano_del_Exmo._NYPL434600.tiff
3. https://commons.wikimedia.org/wiki/File:Cartagena_NYPL1505044.tiff

The example file is 97MB and to test for this duplicate using the API
myself, I would have to locally download a file, calculate the SHA-1
and then query the Commons API for possible duplicates. This would
assume that the EXIF data had not been changed. Considering the sizes
of the files and that this is a batch upload of more than 10,000
images, this is not practical and would in effect make the GWT
irrelevant as I could then upload my local copy without bothering to
create an xml and set up GWT.

Other checks I run when preparing my xml, such as by filename and NYPL
unique ID, cannot find these duplicates. I currently have no idea how
many digitally identical duplicates the GWT has allowed in the NYPL
uploads, this is now a longer term post-upload housekeeping issue.

Fæ
-- 
[email protected] https://commons.wikimedia.org/wiki/User:Fae

_______________________________________________
Glamtools mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/glamtools

Reply via email to