As an example where the current behaviour of GWTools allowing duplicates is a problem; my NYPL uploads are very large files (up to ~300MB images), unfortunately there are instances where the library is giving multiple identities to the same scanned image: Three identical duplicates of a map as uploaded by GWT, two must be deleted at some point: 1. https://commons.wikimedia.org/wiki/File:Carta_dell%27_Egitto,_Sudan,_Mar_Rosso,_Assab,_Massaua,_Abissinia_ecc._NYPL1690373.tiff 2. https://commons.wikimedia.org/wiki/File:Carta_general_del_Oceano_Atlantico_%C3%BA_ocidental_desde_52%C2%BA_de_latitud_norte_hasta_el_Equador_-_construida_de_orden_del_Rey_en_el_Deposito_Hidrografico_de_Marina_y_presentada_%C3%A1_S.M._por_mano_del_Exmo._NYPL434600.tiff 3. https://commons.wikimedia.org/wiki/File:Cartagena_NYPL1505044.tiff
The example file is 97MB and to test for this duplicate using the API myself, I would have to locally download a file, calculate the SHA-1 and then query the Commons API for possible duplicates. This would assume that the EXIF data had not been changed. Considering the sizes of the files and that this is a batch upload of more than 10,000 images, this is not practical and would in effect make the GWT irrelevant as I could then upload my local copy without bothering to create an xml and set up GWT. Other checks I run when preparing my xml, such as by filename and NYPL unique ID, cannot find these duplicates. I currently have no idea how many digitally identical duplicates the GWT has allowed in the NYPL uploads, this is now a longer term post-upload housekeeping issue. Fæ -- [email protected] https://commons.wikimedia.org/wiki/User:Fae _______________________________________________ Glamtools mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/glamtools
