The benefit to doing stuff like this in the database is that the
database can do it smarter than you can in Python. Is that true in
this case? I think not. If we were talking raw MySQL I'd suggest
running your Website column through something like REPLACE(Website,
'://www.', '') and then the count would work reasonably well.

But we're not talking raw MySQL. And if you want to use QuerySet I
assume you're still trying to stay database agnostic. That breaks it
down for me into two cases:

1. This is a query you'll be frequently and must run quickly. In that
case, I'd say to alter your model to include a normalized_website
field that is calculated from Website. Strip the 'www.' either in an
overridden save method or a pre_save signal. And don't use regexp for
this work; urlparse and urlunparse are much more appropriate (never
use a regexp to parse structured data when you can help it). Write a
data migration to handle the existing table. Then you can run your
query over normalized_website and things will work fine.

2. This is a query you'll run infrequently and doesn't need to run
quickly. In that case, just pull out all the records and do the
count/filter with Python code.

On Sun, Jan 25, 2015 at 10:58 PM,  <[email protected]> wrote:
> I posted this question to stackover flow but didn't get a good answer:
> http://stackoverflow.com/questions/28080545/django-find-duplicates-with-queryset-and-regex
>
> I want to find duplicates in db fields with a regex.
>
> I know I can use this to find duplicates:
> self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
>
> I have a model like this:
> class company(models.Model):
>    Website = models.URLField(blank=True, null=True )
>
> The problem is that www and non-www websites are marked as different
> websites.  I want some thing that will return duplicates where it realizes
> www and non-www are the same website.
>
> I know I can use a regex like this for www and non-www:
> Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
>
> Here is an example:
> Company.objects.create(Website='http://example.com')
> Company.objects.create(Website='http://www.example.com')
> Company.objects.create(Website='http://example.org', Name='a')
> Company.objects.create(Website='http://example.org', Name='b')
>
> When I call:
> Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
>
> It returns:
> 1.  http://example.org (from name=a) and http://example.org (from name=b)
>
> This is missing that example.com and www.example.com are the same website
> and duplicates.
>
> I want to use a regex so that I can tell django that example.com and
> www.example.com are the same websites.
>
> I want the return to be:
> 1.  http://example.org (from name=a) and http://example.org (from name=b)
> 2.  example.com www.example.com
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/452fad73-1319-4954-b004-7d0604705f30%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CAD4ANxVaKtLC%2BEBFRibMd%3D8M1nU56RQi0Brcoiy9Vs4ErTtNjg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to