I posted this question to stackover flow but didn't get a good answer:
http://stackoverflow.com/questions/28080545/django-find-duplicates-with-queryset-and-regex
I want to find duplicates in db fields with a regex.
I know I can use this to find duplicates:
self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
I have a model like this:
class company(models.Model):
Website = models.URLField(blank=True, null=True )
The problem is that www and non-www websites are marked as different
websites. I want some thing that will return duplicates where it realizes
www and non-www are the same website.
I know I can use a regex like this for www and non-www:
Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
Here is an example:
Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')
When I call:
Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
It returns:
1. http://example.org (from name=a) and http://example.org (from name=b)
This is missing that example.com and www.example.com are the same website
and duplicates.
I want to use a regex so that I can tell django that example.com and
www.example.com are the same websites.
I want the return to be:
1. http://example.org (from name=a) and http://example.org (from name=b)
2. example.com www.example.com
--
You received this message because you are subscribed to the Google Groups
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-users/452fad73-1319-4954-b004-7d0604705f30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.