The benefit to doing stuff like this in the database is that the database can do it smarter than you can in Python. Is that true in this case? I think not. If we were talking raw MySQL I'd suggest running your Website column through something like REPLACE(Website, '://www.', '') and then the count would work reasonably well.
But we're not talking raw MySQL. And if you want to use QuerySet I assume you're still trying to stay database agnostic. That breaks it down for me into two cases: 1. This is a query you'll be frequently and must run quickly. In that case, I'd say to alter your model to include a normalized_website field that is calculated from Website. Strip the 'www.' either in an overridden save method or a pre_save signal. And don't use regexp for this work; urlparse and urlunparse are much more appropriate (never use a regexp to parse structured data when you can help it). Write a data migration to handle the existing table. Then you can run your query over normalized_website and things will work fine. 2. This is a query you'll run infrequently and doesn't need to run quickly. In that case, just pull out all the records and do the count/filter with Python code. On Sun, Jan 25, 2015 at 10:58 PM, <[email protected]> wrote: > I posted this question to stackover flow but didn't get a good answer: > http://stackoverflow.com/questions/28080545/django-find-duplicates-with-queryset-and-regex > > I want to find duplicates in db fields with a regex. > > I know I can use this to find duplicates: > self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1) > > I have a model like this: > class company(models.Model): > Website = models.URLField(blank=True, null=True ) > > The problem is that www and non-www websites are marked as different > websites. I want some thing that will return duplicates where it realizes > www and non-www are the same website. > > I know I can use a regex like this for www and non-www: > Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' > > Here is an example: > Company.objects.create(Website='http://example.com') > Company.objects.create(Website='http://www.example.com') > Company.objects.create(Website='http://example.org', Name='a') > Company.objects.create(Website='http://example.org', Name='b') > > When I call: > Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1) > > It returns: > 1. http://example.org (from name=a) and http://example.org (from name=b) > > This is missing that example.com and www.example.com are the same website > and duplicates. > > I want to use a regex so that I can tell django that example.com and > www.example.com are the same websites. > > I want the return to be: > 1. http://example.org (from name=a) and http://example.org (from name=b) > 2. example.com www.example.com > > > -- > You received this message because you are subscribed to the Google Groups > "Django users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/django-users. > To view this discussion on the web visit > https://groups.google.com/d/msgid/django-users/452fad73-1319-4954-b004-7d0604705f30%40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAD4ANxVaKtLC%2BEBFRibMd%3D8M1nU56RQi0Brcoiy9Vs4ErTtNjg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

