Am Thu, Dec 07, 2023 at 07:59:38PM +0100 schrieb Lucas Nussbaum:
> On 07/12/23 at 09:58 +0100, Andreas Tille wrote:
> > Hi,
> > 
> > by chance I realised that the uploaders table contains some names where 
> > names
> > are not stripped:
> > 
> > udd=> select '"' || u.name || '"' as name_with_spaces, uploader from 
> > uploaders u where name like '% ' or name like ' %' ;
> >      name_with_spaces     |                 uploader                  
> > --------------------------+-------------------------------------------
> >  " Mehdi Dogguy"          |  Mehdi Dogguy <me...@debian.org>
> >  " David Paleino"         |  David Paleino <da...@debian.org>
> >  " Stéphane Glondu"      |  Stéphane Glondu <glo...@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <z...@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <z...@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <z...@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <z...@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <z...@debian.org>
> >  "Andreas Tille  "        | Andreas Tille   <ti...@debian.org>
> >  " LI Daobing"            |  LI Daobing <lidaob...@debian.org>
> >  " David Paleino"         |  David Paleino <da...@debian.org>
> >  " Stefano Zacchiroli"    |  Stefano Zacchiroli <z...@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yo...@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yo...@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yo...@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yo...@debian.org>
> >  " Nikita V. Youshchenko" |  Nikita V. Youshchenko <yo...@debian.org>
> >  "Colin Tuckley "         | Colin Tuckley  <col...@debian.org>
> >  "Colin Tuckley "         | Colin Tuckley  <col...@debian.org>
> >  "Colin Tuckley "         | Colin Tuckley  <col...@debian.org>
> > (20 rows)
> > ...
> >    UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || 
> > email WHERE name like ' %' or name like '% ' ;
> > 
> 
> Uploaders is refreshed every few hours from archive data, so a one-time
> UPDATE would not help. UDD usually tries to preserve inaccuracies, so
> those might be interesting for QA work.

OK.

> In your case, why don't you use the email address to identify uploaders?

Since this also does not work:

udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP 
BY uploader;
 count |              uploader              
-------+------------------------------------
     1 | Andreas Tille   <ti...@debian.org>
     1 | Andreas Tille <andr...@an3as.eu>
  8785 | Andreas Tille <ti...@debian.org>
(3 Zeilen)

> (possibly combining it with the carnivore data to identify different emails
> belonging to the same person ?)

I could fiddle around with carnivore but that's overkill for thst
purpose and I insist that not stripping blanks from names does not make
any sense, IMHO.  (1 Zeile)


BTW:  I found 

udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN 
maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE 
name ilike '%tille%'  group by name;
 count |     name      
-------+---------------
 16524 | Andreas Tille
(1 Zeile)

So why do I have 8707 uploads per uploaders but 16524 per upload_history?

Is my assumption wrong that both values should match (modulo some wrongly
spelled names)

Kind regards
    Andreas.

-- 
http://fam-tille.de

Reply via email to