Thanks - you've definitely given me some stuff to think about. I'm doing XHR requests - returning JSON for the scraping (but probably later will have normal pages so I will definitely look at your Etag suggestion - I'm not familiar with that so will look into it).
Given it's XHR and JSON I presume eTag isn't relevant, so I think your idea of setting a flag is a good one. So for each row in each table (e.g. a Supplier) that I rescrape - get that from database based on the unique_id and then compare each attribute to the re-scraped JSON and alter flag/update instance if diff. The data will only change about a fraction of a percent of the time (most of the time constant) and it will be about 70k rows with 50 -100 fields. DB is postgres (on Heroku for now). On Friday, 6 November 2015 17:12:05 UTC, Dan Tagg wrote: > > If you are web scraping you really need your code to be as efficient as > possible and to do as little as possible. Firstly, make sure you are using > everything the servers of the websites you are scraping are giving you to > decide whether to bother downloading the page. For example, check the etag > and only bother to scape if it is different from the last time you scraped > data.. If you don't trust the server's ETag, you can hash the page when you > download it and check that against your stored hash so you can check > whether it changed and whether it's worth processing. > > Your approach of trying a 'get' with all the properties set and picking up > the exception has costs -- Assuming your tables have enough rows that > scanning the entire table won't be efficient for every "get" you will need > to have every column you are using in you "get" indexed in the database. > This obviously has a storage cost as well as an additional insert/update > cost and a larger cost to run the query than a simple select against a > single key. Whether that is more efficient than getting the result and > comparing the fields in python I don't know. I imagine it will be dependent > on what your RDBMS is and how it is hosted as well as how many rows and > columns will be in your database table. > > You could initialise a flag to False and as you process your scraped data > you could compare it to the attributes of your instance and set the flag to > True if they have changed and then not bother saving if you get to the end > of processing your scraped data and the modified flag has not been set to > True. > > Dan > > On 6 November 2015 at 16:12, Yunti <[email protected] <javascript:>> wrote: > >> Hi Dan, >> >> Thanks for the suggestion, it's a web scraper (run as a django management >> command) which then saves the data to the database via the Django ORM. >> Given it's a scraper rather than a form (or view) is the above suggested >> function an ok way to proceed or would you suggest something else is more >> appropriate/best practice? >> >> >> >> On Friday, 6 November 2015 14:40:59 UTC, Dan Tagg wrote: >>> >>> Hi Yunti, >>> >>> >>> You could go up a level in the structure of your application and apply >>> the logic there, where there is more support. >>> >>> Are you using Django forms? The ModelForm class pretty much does what >>> you want, it examines form data, validating it against its type and any >>> validation rules you have set in the form or your model, compares it to the >>> instance's data in the database and only saves if there has been some kind >>> of change. >>> >>> Dan >>> >>> On 6 November 2015 at 13:47, Yunti <[email protected]> wrote: >>> >>>> Jani, >>>> >>>> Thanks for your reply - you explained it much more concisely than I >>>> did. :) >>>> >>>> Good to have it confirmed that update_or_create() doesn't quite do what >>>> I needed - I was confused as to whether it would or not. >>>> >>>> Thanks for taking the time to do that function, that looks ideal. I'll >>>> test it out. >>>> >>>> >>>> On Friday, 6 November 2015 12:52:11 UTC, Jani Tiainen wrote: >>>> >>>>> Your problem lies on the way Django actually carries out create or >>>>> update. >>>>> >>>>> As name suggest, create or update does either one. But that's what you >>>>> don't want - you want conditional update. >>>>> >>>>> Only update if certain fields have been changed. Well this can be done >>>>> few ways. >>>>> >>>>> So you want to do >>>>> "update_only_if_at_least_one_of_default_fields_changed_or_create" >>>>> >>>>> Operation is simple, if object is not found, create new one using >>>>> defaults if found, pull values as a dict, compare against >>>>> default values and if at least one differs do an update. Otherwise >>>>> don't do anything. >>>>> >>>>> So basically code would look something like this: >>>>> >>>>> update_if_changed_or_create(**kwargs): >>>>> defaults = kwargs.pop('defaults', None) >>>>> >>>>> qs = MyModel.objects.filter(**kwargs) >>>>> >>>>> if not qs: >>>>> obj = MyModel(**kwargs).save() >>>>> return obj, True # Created object >>>>> else if len(qs) == 1: >>>>> obj = qs[0] >>>>> changed = False >>>>> for k, v in defaults: >>>>> if getattr(obj, k) != v: >>>>> changed = True >>>>> setattr(obj, k, v) >>>>> if changed: >>>>> obj.save() >>>>> return obj, False # Updated object >>>>> else: >>>>> # Multiple objects... >>>>> >>>>> return obj, None # No change. >>>>> >>>>> >>>>> On 06.11.2015 14:08, Yunti wrote: >>>>> >>>>> Carsten , >>>>> >>>>> Thanks for your reply, >>>>> >>>>> A note about the last statement: If a Supplier object has the same >>>>> unique_id, and all >>>>> other fields (in `defaults`) are the same as well, logically there is >>>>> no difference >>>>> between updating and not updating – the result is the same. >>>>> >>>>> The entry in the database is the same - apart from the last_updated >>>>> flag if it's not rewritten over the top of it. This means I can check >>>>> for >>>>> new data often and be alerted when there is an actual update (i.e. a >>>>> change >>>>> to the data). If it rewrites the data everytime it checks then I have no >>>>> idea when data was actually updated. >>>>> >>>>> Have you checked? How? >>>>> In your create_or_update_if_diff() you seem to try to re-invent >>>>> update_or_create(), but >>>>> have you actually examined the results of the >>>>> >>>>> supplier, created = Supplier.objects.update_or_create(...) >>>>> >>>>> call? >>>>> >>>>> I checked by seeing that the last_updated field in the database was >>>>> updated everytime. (I suppose the issue could be with how that field >>>>> gets >>>>> reset to the next time it's run- I didn't eliminate that possibility.) >>>>> >>>>> Yes I was worried that I might be recreating (a poor version) of >>>>> update_or_create() but it didn't seem to have the option where it >>>>> wouldn't >>>>> write to the database if there was no change to the data. >>>>> Can it do this? And how would I verify when an item has been updated >>>>> or created (or neither) - could I output to the console? >>>>> >>>>> If it can how do I call it so it checks against all fields (unique_id >>>>> and defaults) and updates using the defaults if it finds a difference >>>>> (and >>>>> creates if it doesn't find a unique_id)? >>>>> >>>>> I'm still not sure if this is possible and how to call the function, >>>>> particular how to pass in the remaining defaults to check against - >>>>> **kwargs = defaults isn't right but not sure what it should be. >>>>> >>>>> supplier, created = >>>>> Supplier.objects.update_or_create(unique_id=product_detail['supplierId'], >>>>> **kwargs=defaults, >>>>> defaults={ >>>>> 'name': >>>>> product_detail['supplierName'], >>>>> >>>>> 'entity_name_1': entity_name_1, >>>>> >>>>> 'entity_name_2': entity_name_1, >>>>> 'rating': >>>>> product_detail['supplierRating']}) >>>>> >>>>> On Thursday, 5 November 2015 20:05:39 UTC, Carsten Fuchs wrote: >>>>>> >>>>>> Hi Yunti, Am 05.11.2015 um 18:19 schrieb Yunti: > I have tried to use >>>>>> the update_or_create() method assuming that it would either, create > a >>>>>> new >>>>>> entry in the db if it found none or update an existing one if it found >>>>>> one >>>>>> and had > differences to the defaults passed in - or wouldn't update if >>>>>> there was no difference. A note about the last statement: If a Supplier >>>>>> object has the same unique_id, and all other fields (in `defaults`) are >>>>>> the >>>>>> same as well, logically there is no difference between updating and not >>>>>> updating – the result is the same. > However it just seemed to >>>>>> recreate >>>>>> entries each time even if there were no changes. Have you checked? How? >>>>>> In >>>>>> your create_or_update_if_diff() you seem to try to re-invent >>>>>> update_or_create(), but have you actually examined the results of the >>>>>> >>>>>> supplier, created = Supplier.objects.update_or_create(...) call? > I >>>>>> think >>>>>> the issue was that I wanted to: > 1) get an entry if all fields were >>>>>> the >>>>>> same, update_or_create() updates an object with the given kwargs, the >>>>>> match >>>>>> is not made against *all* fields (i.e. for the match the fields in >>>>>> `defaults` are not accounted for). > 2) or create a new entry if it >>>>>> didn't >>>>>> find an existing entry with the unique_id > 3) or if there was an entry >>>>>> with the same unique_id, update that entry with remaining > fields. >>>>>> update_or_create() should achieve this. It's hard to tell more without >>>>>> additional information, but >>>>>> https://docs.djangoproject.com/en/1.8/ref/models/querysets/#update-or-create >>>>>> >>>>>> explains the function well, including how it works. If you work through >>>>>> this in small steps, check examples and their (intermediate) results, >>>>>> you >>>>>> should be able to find what the original problem was. Best regards, >>>>>> Carsten >>>>> >>>>> -- You received this message because you are subscribed to the Google >>>>> Groups "Django users" group. To unsubscribe from this group and stop >>>>> receiving emails from it, send an email to >>>>> [email protected]. To post to this group, send email >>>>> to [email protected]. Visit this group at >>>>> http://groups.google.com/group/django-users. To view this discussion >>>>> on the web visit >>>>> https://groups.google.com/d/msgid/django-users/9b529e2d-7e2b-4194-a77c-8434efe6205d%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/django-users/9b529e2d-7e2b-4194-a77c-8434efe6205d%40googlegroups.com?utm_medium=email&utm_source=footer>. >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Django users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/django-users. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/django-users/889c6480-98b3-415d-af92-490d11de5695%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/django-users/889c6480-98b3-415d-af92-490d11de5695%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Django users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/django-users. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/django-users/3cea33db-f2e7-4739-a202-99a717bda092%40googlegroups.com >> >> <https://groups.google.com/d/msgid/django-users/3cea33db-f2e7-4739-a202-99a717bda092%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Wildman and Herring Limited, Registered Office: 52 Great Eastern Street, > London, EC2A 3EP, Company no: 05766374 > -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/9cf79d88-55f6-4b87-9cc4-7050b14a52fe%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

