Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum: > On 07/12/23 at 09:58 +0100, Andreas Tille wrote: > > Hi, > > > > by chance I realised that the uploaders table contains some names where > > names > > are not stripped: > > > > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from > > uploaders u where name like '% ' or name like ' %' ; > > name_with_spaces | uploader > > --------------------------+------------------------------------------- > > " Mehdi Dogguy" | Mehdi Dogguy <me...@debian.org> > > " David Paleino" | David Paleino <da...@debian.org> > > " Stéphane Glondu" | Stéphane Glondu <glo...@debian.org> > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > "Andreas Tille " | Andreas Tille <ti...@debian.org> > > " LI Daobing" | LI Daobing <lidaob...@debian.org> > > " David Paleino" | David Paleino <da...@debian.org> > > " Stefano Zacchiroli" | Stefano Zacchiroli <z...@debian.org> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > " Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org> > > "Colin Tuckley " | Colin Tuckley <col...@debian.org> > > "Colin Tuckley " | Colin Tuckley <col...@debian.org> > > "Colin Tuckley " | Colin Tuckley <col...@debian.org> > > (20 rows) > > ... > > UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || > > email WHERE name like ' %' or name like '% ' ; > > > > Uploaders is refreshed every few hours from archive data, so a one-time > UPDATE would not help. UDD usually tries to preserve inaccuracies, so > those might be interesting for QA work.
OK. > In your case, why don't you use the email address to identify uploaders? Since this also does not work: udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader; count | uploader -------+------------------------------------ 1 | Andreas Tille <ti...@debian.org> 1 | Andreas Tille <andr...@an3as.eu> 8785 | Andreas Tille <ti...@debian.org> (3 Zeilen) > (possibly combining it with the carnivore data to identify different emails > belonging to the same person ?) I could fiddle around with carnivore but that's overkill for thst purpose and I insist that not stripping blanks from names does not make any sense, IMHO. (1 Zeile) BTW: I found udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name; count | name -------+--------------- 16524 | Andreas Tille (1 Zeile) So why do I have 8707 uploads per uploaders but 16524 per upload_history? Is my assumption wrong that both values should match (modulo some wrongly spelled names) Kind regards Andreas. -- http://fam-tille.de