I posted this question to stackover flow but didn't get a good answer: 
http://stackoverflow.com/questions/28080545/django-find-duplicates-with-queryset-and-regex

I want to find duplicates in db fields with a regex. 

I know I can use this to find duplicates:
self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I have a model like this: 
class company(models.Model):
   Website = models.URLField(blank=True, null=True )

The problem is that www and non-www websites are marked as different 
websites.  I want some thing that will return duplicates where it realizes 
www and non-www are the same website.

I know I can use a regex like this for www and non-www:
Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Here is an example:
Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')

When I call: 
Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

It returns:
1.  http://example.org (from name=a) and http://example.org (from name=b)

This is missing that example.com and www.example.com are the same website 
and duplicates.

I want to use a regex so that I can tell django that example.com and 
www.example.com are the same websites.

I want the return to be:
1.  http://example.org (from name=a) and http://example.org (from name=b)
2.  example.com www.example.com
    

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/452fad73-1319-4954-b004-7d0604705f30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to