Re: ORM using tons of memory and CPU

2009-12-15 Thread Jirka Vejrazka
Well, I was bound to get something wrong :)

> to_be_purged = 
> archivedEmail.objects.filter(received__lte=newest_to_delete).values('cacheID',
>  flat=True)

the end of the line should be  .values_list('cacheID', flat=True)   #
not .values(

  Cheers

Jirka

--

You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.




Re: ORM using tons of memory and CPU

2009-12-15 Thread Jirka Vejrazka
Hi,

 correct me if I got it wrong, but you essentially need these 4 things:
 1) obtain the date for the the newest messages to delete
 2) get cacheID of all objects to be deleted
 3) delete the files
 4) delete these objects from the database

So, you could use something like this:

# get the date
newest_to_delete = datetime.today() - timedelta(days=settings.DAYSTOKEEP)
# get cacheID's of emails to be deleted, requires Django 1.0+
to_be_purged = 
archivedEmail.objects.filter(received__lte=newest_to_delete).values('cacheID',
flat=True)
# to_be_purged is now a Python list of cacheID's
for item in to_be_purged:
delete_file(item)  # do the os.unlink() operations
# now delete the entries from the database
archivedEmail.objects.filter(received__lte=newest_to_delete).delete()

Since you keep all files in flat directories, I'm assuming that you
are not deleting millions of emails per day (otherwise the file lookup
operations would really be the bottleneck). Hence, this "list of
cacheID's" approach might be the fastest way. The iterator idea
suggested before is good too, you'd have to compare them yourself on
your data set to figure out what works better.

  Hope this helps

Jirka

--

You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.




Re: ORM using tons of memory and CPU

2009-12-15 Thread Tracy Reed
On Tue, Dec 15, 2009 at 03:35:29AM -0800, bruno desthuilliers spake thusly:
> looks like settings.DEBUG=True to me.

Nope. settings.py has DEBUG = False

> wrt/ the other mentioned problem - building whole model instances for
> each row - you can obviously save a lot of work here by using a
> value_list queryset - tuples are very cheap.

I don't understand which other problem you are referring to here... If
I mentioned something about building whole model instances for each
row I didn't realize it.

> Now for something different - here are a couple other python
> optimisation tricks:

Thanks!

> Oh and yes, one last point : how do you run this script exactly ?

I set the PYTHONPATH and DJANGO_SETTINGS_MODULE env vars and then do 
./purgemail.py

-- 
Tracy Reed
http://tracyreed.org


pgpSPr64Iuvkk.pgp
Description: PGP signature


Re: ORM using tons of memory and CPU

2009-12-15 Thread Tracy Reed
On Tue, Dec 15, 2009 at 09:43:02AM +0200, Jani Tiainen spake thusly:
> If you have DEBUG=True setting Django also records _every_ SQL query
> made to database and depending on a case, it might use quite lot of
> memory.

My settings.py contains:

DEBUG = False

-- 
Tracy Reed
http://tracyreed.org


pgp8LdeChJimo.pgp
Description: PGP signature


Re: ORM using tons of memory and CPU

2009-12-15 Thread bruno desthuilliers
On 15 déc, 02:44, Tracy Reed  wrote:
> I have code which looks basically like this:
>
> now        = datetime.today()
> beginning  = datetime.fromtimestamp(0)
> end        = now - timedelta(days=settings.DAYSTOKEEP)
>
> def purgedb():
>     """Delete archivedEmail objects from the beginning of time until
>     daystokeep days in the past."""
>     queryset   = archivedEmail.objects.all()
>     purgeset   = queryset.filter(received__range=(beginning, end))
>     for email in purgeset:
>         print email
>         try:
>             os.unlink(settings.REAVER_CACHE+"texts/%s"     % email.cacheID)
>             os.unlink(settings.REAVER_CACHE+"prob_good/%s" % email.cacheID)
>             os.unlink(settings.REAVER_CACHE+"prob_spam/%s" % email.cacheID)
>         except OSError:
>             pass
>     purgeset.delete()
>
> if __name__ == '__main__':
>     purgedb()
>
(snip)

> But when purgedb runs it deletes emails 100 at a time (which takes
> forever) and after running for a couple of hours uses a gig and a half
> of RAM. If I let it continue after a number of hours it runs the
> machine out of RAM/swap.

looks like settings.DEBUG=True to me.

> Am I doing something which is not idiomatic or misusing the ORM
> somehow? My understanding is that it should be lazy so using
> objects.all() on queryset and then narrowing it down with a
> queryset.filter() to make a purgeset should be ok, right?

No problem here as long as you don't do anything that forces
evaluation of the queryset. But this is still redundant - you can as
well build the appropriate queryset immediatly.

> What can I
> do to make this run in reasonable time/memory?

Others already commented on checking whether you have settings.DEBUG
set to True - the usual suspects when it comes to RAM issues with
django's ORM.

wrt/ the other mentioned problem - building whole model instances for
each row - you can obviously save a lot of work here by using a
value_list queryset - tuples are very cheap.

Oh, and yes: I/O and filesystem operations are not free neither. This
doesn't solve your pb with the script eating all the RAM, but surely
impacts the overall performances.


Now for something different - here are a couple other python
optimisation tricks:

> for email in purgeset:
> print email

Remove this. I/O are not for free. Really.

> try:
> os.unlink(settings.REAVER_CACHE+"texts/%s" % email.cacheID)
> os.unlink(settings.REAVER_CACHE+"prob_good/%s" % email.cacheID)
> os.unlink(settings.REAVER_CACHE+"prob_spam/%s" % email.cacheID)
> except OSError:
> pass

Move all redundant attribute lookup (os.unlink and
settings.REAVER_CACHE) and string concatenations out of this loop.


def purgedb():
  """Delete archivedEmail objects from the beginning of time until
daystokeep days in the past.
  """
  text_cache = settings.REAVER_CACHE + "texts/%s"
  prob_good_cache = settings.REAVER_CACHE+"prob_good/%s"
  prob_spam_cache = settings.REAVER_CACHE+"prob_spam/%s"
  unlink = os.unlink

  # no reason to put this outside the function.
  now = datetime.today()
  beginning = datetime.fromtimestamp(0)
  end = now - timedelta(days=settings.DAYSTOKEEP)
  qs = archivedEmail.objects.filter(received__range=(beginning, end))

  for row in qs.value_list(cacheID):
cacheID = row[0]
try:
  unlink(text_cache % cacheID)
  unlink(prob_good_cache % cacheID)
  unlink(prob_spam_cache % cacheID)
except OSError:
  pass

  qs.delete()

Oh and yes, one last point : how do you run this script exactly ?

HTH

--

You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.




Re: ORM using tons of memory and CPU

2009-12-15 Thread rebus_
2009/12/15 Tracy Reed :
>
> I have code which looks basically like this:
>
> now        = datetime.today()
> beginning  = datetime.fromtimestamp(0)
> end        = now - timedelta(days=settings.DAYSTOKEEP)
>
> def purgedb():
>    """Delete archivedEmail objects from the beginning of time until
>    daystokeep days in the past."""
>    queryset   = archivedEmail.objects.all()
>    purgeset   = queryset.filter(received__range=(beginning, end))

You don't need both queries (altghou they are lazy). You could just say:

purgeset = archivedEmail.filter(received__range=(beginning, end))


>    for email in purgeset:
>        print email
>        try:
>            os.unlink(settings.REAVER_CACHE+"texts/%s"     % email.cacheID)
>            os.unlink(settings.REAVER_CACHE+"prob_good/%s" % email.cacheID)
>            os.unlink(settings.REAVER_CACHE+"prob_spam/%s" % email.cacheID)
>        except OSError:
>            pass
>    purgeset.delete()
>
> if __name__ == '__main__':
>    purgedb()
>
> The idea is that we are stuffing a bunch of emails in a database for
> customer service purposes. I want to clear out anything older than
> DAYSTOKEEP. The model looks like this:
>
> class archivedEmail(models.Model):
>    subject     = models.CharField(blank=True, max_length=512, null=True)
>    toAddress   = models.CharField(blank=True, max_length=128, db_index=True)
>    fromAddress = models.CharField(blank=True, max_length=128, db_index=True)
>    date        = models.DateTimeField()
>    received    = models.DateTimeField(db_index=True)
>    crmScore    = models.FloatField()
>    spamStatus  = models.CharField(max_length=6, choices=spamStatusChoices, 
> db_index=True)
>    cacheHost   = models.CharField(max_length=24)
>    cacheID     = models.CharField(max_length=31, primary_key=True)
>
>    class Meta:
>        ordering = ('-received',)
>
> But when purgedb runs it deletes emails 100 at a time (which takes
> forever) and after running for a couple of hours uses a gig and a half
> of RAM. If I let it continue after a number of hours it runs the
> machine out of RAM/swap.
>
> Am I doing something which is not idiomatic or misusing the ORM
> somehow? My understanding is that it should be lazy so using
> objects.all() on queryset and then narrowing it down with a
> queryset.filter() to make a purgeset should be ok, right? What can I
> do to make this run in reasonable time/memory?
>
> PS: I used to have ordering set to -date in the class Meta but that
> caused the db to always put an ORDER BY date on the select query which
> was unnecessary in this case causing it to take ages sorting a couple
> million rows since there is no index on date (nor did there need to
> be, so I thought, since we never select on it). Changing it to
> received makes no difference to my app but avoids creating another
> index. Django's is the first ORM I have ever used and these sneaky
> performance issues are making me wonder...
>
> --
> Tracy Reed
> http://tracyreed.org
>


When you execute a query set that operates on millions of rows you
should remember that ORM will create a python object for each row.
This could take a while. Though i am not sure how it handles memory
issues.

You could issue plain SQL query which returns result as tuple (only
returning cache ID for example) and might be faster if you don't
really need ORM, then loop through the tuple and delete files on disk.
I think this should use less memory and processing time.

Also, I am not SQL guru but i guess BETWEEN should work on most DB
servers so you should not have portability issues (but i cant
guarantee this).

Don't use try ... except for checking if the file exists. Rather use
"if" statement with os.path.exists or os.path.isfile and if it does
exist delete the file. Exceptions are expensive in CPython.

Read this topic about deleting objects in bulk (though i think you
already have it never hurts to refresh your memory).

http://docs.djangoproject.com/en/dev/topics/db/queries/#deleting-objects

Just my 2 cents

Davor

--

You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.




Re: ORM using tons of memory and CPU

2009-12-14 Thread Jani Tiainen
On Mon, 2009-12-14 at 17:44 -0800, Tracy Reed wrote:
> I have code which looks basically like this:
> 
> now= datetime.today()
> beginning  = datetime.fromtimestamp(0)
> end= now - timedelta(days=settings.DAYSTOKEEP)
> 
> def purgedb():
> """Delete archivedEmail objects from the beginning of time until
> daystokeep days in the past."""
> queryset   = archivedEmail.objects.all()
> purgeset   = queryset.filter(received__range=(beginning, end))
> for email in purgeset:
> print email
> try:
> os.unlink(settings.REAVER_CACHE+"texts/%s" % email.cacheID)
> os.unlink(settings.REAVER_CACHE+"prob_good/%s" % email.cacheID)
> os.unlink(settings.REAVER_CACHE+"prob_spam/%s" % email.cacheID)
> except OSError:
> pass
> purgeset.delete()
> 
> if __name__ == '__main__':
> purgedb()
> 
> The idea is that we are stuffing a bunch of emails in a database for
> customer service purposes. I want to clear out anything older than
> DAYSTOKEEP. The model looks like this:
> 
> class archivedEmail(models.Model):
> subject = models.CharField(blank=True, max_length=512, null=True)
> toAddress   = models.CharField(blank=True, max_length=128, db_index=True)
> fromAddress = models.CharField(blank=True, max_length=128, db_index=True)
> date= models.DateTimeField()
> received= models.DateTimeField(db_index=True)
> crmScore= models.FloatField()
> spamStatus  = models.CharField(max_length=6, choices=spamStatusChoices, 
> db_index=True)
> cacheHost   = models.CharField(max_length=24)
> cacheID = models.CharField(max_length=31, primary_key=True)
> 
> class Meta:
> ordering = ('-received',)
> 
> But when purgedb runs it deletes emails 100 at a time (which takes
> forever) and after running for a couple of hours uses a gig and a half
> of RAM. If I let it continue after a number of hours it runs the
> machine out of RAM/swap.
> 
> Am I doing something which is not idiomatic or misusing the ORM
> somehow? My understanding is that it should be lazy so using
> objects.all() on queryset and then narrowing it down with a
> queryset.filter() to make a purgeset should be ok, right? What can I
> do to make this run in reasonable time/memory?
> 
> PS: I used to have ordering set to -date in the class Meta but that
> caused the db to always put an ORDER BY date on the select query which
> was unnecessary in this case causing it to take ages sorting a couple
> million rows since there is no index on date (nor did there need to
> be, so I thought, since we never select on it). Changing it to
> received makes no difference to my app but avoids creating another
> index. Django's is the first ORM I have ever used and these sneaky
> performance issues are making me wonder...
> 

If you have DEBUG=True setting Django also records _every_ SQL query
made to database and depending on a case, it might use quite lot of
memory.

-- 

Jani Tiainen


--

You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.




Re: ORM using tons of memory and CPU

2009-12-14 Thread fordprefect
Try using a) queryset.iterator() to iterate through the results and b)
paginating the results. Otherwise you run the risk of loading all the
results into memory at once,

On Dec 15, 1:44 am, Tracy Reed  wrote:
> I have code which looks basically like this:
>
> now        = datetime.today()
> beginning  = datetime.fromtimestamp(0)
> end        = now - timedelta(days=settings.DAYSTOKEEP)
>
> def purgedb():
>     """Delete archivedEmail objects from the beginning of time until
>     daystokeep days in the past."""
>     queryset   = archivedEmail.objects.all()
>     purgeset   = queryset.filter(received__range=(beginning, end))
>     for email in purgeset:
>         print email
>         try:
>             os.unlink(settings.REAVER_CACHE+"texts/%s"     % email.cacheID)
>             os.unlink(settings.REAVER_CACHE+"prob_good/%s" % email.cacheID)
>             os.unlink(settings.REAVER_CACHE+"prob_spam/%s" % email.cacheID)
>         except OSError:
>             pass
>     purgeset.delete()
>
> if __name__ == '__main__':
>     purgedb()
>
> The idea is that we are stuffing a bunch of emails in a database for
> customer service purposes. I want to clear out anything older than
> DAYSTOKEEP. The model looks like this:
>
> class archivedEmail(models.Model):
>     subject     = models.CharField(blank=True, max_length=512, null=True)
>     toAddress   = models.CharField(blank=True, max_length=128, db_index=True)
>     fromAddress = models.CharField(blank=True, max_length=128, db_index=True)
>     date        = models.DateTimeField()
>     received    = models.DateTimeField(db_index=True)
>     crmScore    = models.FloatField()
>     spamStatus  = models.CharField(max_length=6, choices=spamStatusChoices, 
> db_index=True)
>     cacheHost   = models.CharField(max_length=24)
>     cacheID     = models.CharField(max_length=31, primary_key=True)
>
>     class Meta:
>         ordering = ('-received',)
>
> But when purgedb runs it deletes emails 100 at a time (which takes
> forever) and after running for a couple of hours uses a gig and a half
> of RAM. If I let it continue after a number of hours it runs the
> machine out of RAM/swap.
>
> Am I doing something which is not idiomatic or misusing the ORM
> somehow? My understanding is that it should be lazy so using
> objects.all() on queryset and then narrowing it down with a
> queryset.filter() to make a purgeset should be ok, right? What can I
> do to make this run in reasonable time/memory?
>
> PS: I used to have ordering set to -date in the class Meta but that
> caused the db to always put an ORDER BY date on the select query which
> was unnecessary in this case causing it to take ages sorting a couple
> million rows since there is no index on date (nor did there need to
> be, so I thought, since we never select on it). Changing it to
> received makes no difference to my app but avoids creating another
> index. Django's is the first ORM I have ever used and these sneaky
> performance issues are making me wonder...
>
> --
> Tracy Reedhttp://tracyreed.org
>
>  application_pgp-signature_part
> < 1KViewDownload

--

You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.




ORM using tons of memory and CPU

2009-12-14 Thread Tracy Reed

I have code which looks basically like this:

now= datetime.today()
beginning  = datetime.fromtimestamp(0)
end= now - timedelta(days=settings.DAYSTOKEEP)

def purgedb():
"""Delete archivedEmail objects from the beginning of time until
daystokeep days in the past."""
queryset   = archivedEmail.objects.all()
purgeset   = queryset.filter(received__range=(beginning, end))
for email in purgeset:
print email
try:
os.unlink(settings.REAVER_CACHE+"texts/%s" % email.cacheID)
os.unlink(settings.REAVER_CACHE+"prob_good/%s" % email.cacheID)
os.unlink(settings.REAVER_CACHE+"prob_spam/%s" % email.cacheID)
except OSError:
pass
purgeset.delete()

if __name__ == '__main__':
purgedb()

The idea is that we are stuffing a bunch of emails in a database for
customer service purposes. I want to clear out anything older than
DAYSTOKEEP. The model looks like this:

class archivedEmail(models.Model):
subject = models.CharField(blank=True, max_length=512, null=True)
toAddress   = models.CharField(blank=True, max_length=128, db_index=True)
fromAddress = models.CharField(blank=True, max_length=128, db_index=True)
date= models.DateTimeField()
received= models.DateTimeField(db_index=True)
crmScore= models.FloatField()
spamStatus  = models.CharField(max_length=6, choices=spamStatusChoices, 
db_index=True)
cacheHost   = models.CharField(max_length=24)
cacheID = models.CharField(max_length=31, primary_key=True)

class Meta:
ordering = ('-received',)

But when purgedb runs it deletes emails 100 at a time (which takes
forever) and after running for a couple of hours uses a gig and a half
of RAM. If I let it continue after a number of hours it runs the
machine out of RAM/swap.

Am I doing something which is not idiomatic or misusing the ORM
somehow? My understanding is that it should be lazy so using
objects.all() on queryset and then narrowing it down with a
queryset.filter() to make a purgeset should be ok, right? What can I
do to make this run in reasonable time/memory?

PS: I used to have ordering set to -date in the class Meta but that
caused the db to always put an ORDER BY date on the select query which
was unnecessary in this case causing it to take ages sorting a couple
million rows since there is no index on date (nor did there need to
be, so I thought, since we never select on it). Changing it to
received makes no difference to my app but avoids creating another
index. Django's is the first ORM I have ever used and these sneaky
performance issues are making me wonder...

-- 
Tracy Reed
http://tracyreed.org


pgpdFIVBgVCfG.pgp
Description: PGP signature