Thanks - you've definitely given me some stuff to think about.  I'm doing 
XHR requests - returning JSON for the scraping (but probably later will 
have normal pages so I will definitely look at your Etag suggestion - I'm 
not familiar with that so will look into it). 

Given it's XHR and JSON I presume eTag isn't relevant, so I think your idea 
of setting a flag is a good one.  So for each row in each table (e.g. a 
Supplier) that I rescrape - get that from database based on the unique_id 
and then compare each attribute to the re-scraped JSON and alter 
flag/update instance if diff.  

The data will only change about a fraction of a percent of the time (most 
of the time constant) and it will be about 70k rows with 50 -100 fields. 
 DB is postgres (on Heroku for now). 

On Friday, 6 November 2015 17:12:05 UTC, Dan Tagg wrote:
>
> If you are web scraping you really need your code to be as efficient as 
> possible and to do as little as possible. Firstly, make sure you are using 
> everything the servers of the websites you are scraping are giving you to 
> decide whether to bother downloading the page. For example, check the etag 
> and only bother to scape if it is different from the last time you scraped 
> data.. If you don't trust the server's ETag, you can hash the page when you 
> download it and check that against your stored hash so you can check 
> whether it changed and whether it's worth processing. 
>
> Your approach of trying a 'get' with all the properties set and picking up 
> the exception has costs -- Assuming your tables have enough rows that 
> scanning the entire table won't be efficient for every "get" you will need 
> to have every column you are using in you "get" indexed in the database. 
> This obviously has a storage cost as well as an additional insert/update 
> cost and a larger cost to run the query than a simple select against a 
> single key. Whether that is more efficient than getting the result and 
> comparing the fields in python I don't know. I imagine it will be dependent 
> on what your RDBMS is and how it is hosted as well as how many rows and 
> columns will be in your database table.
>
> You could initialise a flag to False and as you process your scraped data 
> you could compare it to the attributes of your instance and set the flag to 
> True if they have changed and then not bother saving if you get to the end 
> of processing your scraped data and the modified flag has not been set to 
> True.
>
> Dan 
>
> On 6 November 2015 at 16:12, Yunti <[email protected] <javascript:>> wrote:
>
>> Hi Dan,
>>
>> Thanks for the suggestion, it's a web scraper (run as a django management 
>> command) which then saves the data to the database via the Django ORM.  
>> Given it's a scraper rather than a form (or view) is the above suggested 
>> function an ok way to proceed or would you suggest something else is more 
>> appropriate/best practice?
>>
>>
>>
>> On Friday, 6 November 2015 14:40:59 UTC, Dan Tagg wrote:
>>>
>>> Hi Yunti,
>>>
>>>
>>> You could go up a level in the structure of your application and apply 
>>> the logic there, where there is more support.
>>>
>>> Are you using Django forms? The ModelForm class pretty much does what 
>>> you want, it examines form data, validating it against its type and any 
>>> validation rules you have set in the form or your model, compares it to the 
>>> instance's data in the database and only saves if there has been some kind 
>>> of change. 
>>>
>>> Dan
>>>
>>> On 6 November 2015 at 13:47, Yunti <[email protected]> wrote:
>>>
>>>> Jani,
>>>>
>>>> Thanks for your reply - you explained it much more concisely than I 
>>>> did. :)
>>>>
>>>> Good to have it confirmed that update_or_create() doesn't quite do what 
>>>> I needed - I was confused as to whether it would or not.
>>>>
>>>> Thanks for taking the time to do that function, that looks ideal. I'll 
>>>> test it out.
>>>>
>>>>
>>>> On Friday, 6 November 2015 12:52:11 UTC, Jani Tiainen wrote:
>>>>
>>>>> Your problem lies on the way Django actually carries out create or 
>>>>> update.
>>>>>
>>>>> As name suggest, create or update does either one. But that's what you 
>>>>> don't want - you want conditional update.
>>>>>
>>>>> Only update if certain fields have been changed. Well this can be done 
>>>>> few ways.
>>>>>
>>>>> So you want to do 
>>>>> "update_only_if_at_least_one_of_default_fields_changed_or_create"
>>>>>
>>>>> Operation is simple, if object is not found, create new one using 
>>>>> defaults if found, pull values as a dict, compare against
>>>>> default values and if at least one differs do an update. Otherwise 
>>>>> don't do anything.
>>>>>
>>>>> So basically code would look something like this:
>>>>>
>>>>> update_if_changed_or_create(**kwargs):
>>>>>     defaults = kwargs.pop('defaults', None)
>>>>>
>>>>>     qs = MyModel.objects.filter(**kwargs)
>>>>>
>>>>>      if not qs:
>>>>>         obj = MyModel(**kwargs).save()
>>>>>         return obj, True  # Created object
>>>>>     else if len(qs) == 1:
>>>>>         obj = qs[0]
>>>>>         changed = False
>>>>>         for k, v in defaults:
>>>>>              if getattr(obj, k) != v:
>>>>>                  changed = True
>>>>>                  setattr(obj, k, v)
>>>>>         if changed:
>>>>>             obj.save()
>>>>>         return obj, False  # Updated object
>>>>>     else:
>>>>>         # Multiple objects...
>>>>>
>>>>>     return obj, None  # No change.
>>>>>
>>>>>
>>>>> On 06.11.2015 14:08, Yunti wrote:
>>>>>
>>>>> Carsten , 
>>>>>
>>>>> Thanks for your reply,
>>>>>
>>>>> A note about the last statement: If a Supplier object has the same 
>>>>> unique_id, and all 
>>>>> other fields (in `defaults`) are the same as well, logically there is 
>>>>> no difference 
>>>>> between updating and not updating – the result is the same. 
>>>>>
>>>>> The entry in the database is the same - apart from the last_updated 
>>>>> flag if it's not rewritten over the top of it.  This means I can check 
>>>>> for 
>>>>> new data often and be alerted when there is an actual update (i.e. a 
>>>>> change 
>>>>> to the data).  If it rewrites the data everytime it checks then I have no 
>>>>> idea when data was actually updated.
>>>>>
>>>>> Have you checked? How? 
>>>>> In your create_or_update_if_diff() you seem to try to re-invent 
>>>>> update_or_create(), but 
>>>>> have you actually examined the results of the 
>>>>>
>>>>>      supplier, created = Supplier.objects.update_or_create(...) 
>>>>>
>>>>> call? 
>>>>>
>>>>> I checked by seeing that the last_updated field in the database was 
>>>>> updated everytime.  (I suppose the issue could be with how that field 
>>>>> gets 
>>>>> reset to the next time it's run- I didn't eliminate that possibility.)
>>>>>
>>>>> Yes I was worried that I might be recreating (a poor version) of 
>>>>> update_or_create() but it didn't seem to have the option where it 
>>>>> wouldn't 
>>>>> write to the database if there was no change to the data.   
>>>>> Can it do this? And how would I verify when an item has been updated 
>>>>> or created (or neither) - could I output to the console? 
>>>>>
>>>>> If it can how do I call it so it checks against all fields (unique_id 
>>>>> and defaults) and updates using the defaults if it finds a difference 
>>>>> (and 
>>>>> creates if it doesn't find a unique_id)?
>>>>>
>>>>> I'm still not sure if this is possible and how to call the function, 
>>>>> particular how to pass in the remaining defaults to check against - 
>>>>> **kwargs = defaults isn't right but not sure what it should be.
>>>>>
>>>>> supplier, created = 
>>>>> Supplier.objects.update_or_create(unique_id=product_detail['supplierId'], 
>>>>> **kwargs=defaults, 
>>>>>                                                        defaults={
>>>>>                                                            'name': 
>>>>> product_detail['supplierName'],
>>>>>                                                            
>>>>> 'entity_name_1': entity_name_1,
>>>>>                                                            
>>>>> 'entity_name_2': entity_name_1,
>>>>>                                                            'rating': 
>>>>> product_detail['supplierRating']})
>>>>>
>>>>> On Thursday, 5 November 2015 20:05:39 UTC, Carsten Fuchs wrote:
>>>>>>
>>>>>> Hi Yunti, Am 05.11.2015 um 18:19 schrieb Yunti: > I have tried to use 
>>>>>> the update_or_create() method assuming that it would either, create > a 
>>>>>> new 
>>>>>> entry in the db if it found none or update an existing one if it found 
>>>>>> one 
>>>>>> and had > differences to the defaults passed in  - or wouldn't update if 
>>>>>> there was no difference. A note about the last statement: If a Supplier 
>>>>>> object has the same unique_id, and all other fields (in `defaults`) are 
>>>>>> the 
>>>>>> same as well, logically there is no difference between updating and not 
>>>>>> updating – the result is the same. >   However it just seemed to 
>>>>>> recreate 
>>>>>> entries each time even if there were no changes. Have you checked? How? 
>>>>>> In 
>>>>>> your create_or_update_if_diff() you seem to try to re-invent 
>>>>>> update_or_create(), but have you actually examined the results of the    
>>>>>>  
>>>>>>  supplier, created = Supplier.objects.update_or_create(...) call? > I 
>>>>>> think 
>>>>>> the issue was that I wanted to: > 1)  get an entry if all fields were 
>>>>>> the 
>>>>>> same, update_or_create() updates an object with the given kwargs, the 
>>>>>> match 
>>>>>> is not made against *all* fields (i.e. for the match the fields in 
>>>>>> `defaults` are not accounted for). > 2) or create a new entry if it 
>>>>>> didn't 
>>>>>> find an existing entry with the unique_id > 3) or if there was an entry 
>>>>>> with the same unique_id, update that entry with remaining > fields. 
>>>>>> update_or_create() should achieve this. It's hard to tell more without 
>>>>>> additional information, but 
>>>>>> https://docs.djangoproject.com/en/1.8/ref/models/querysets/#update-or-create
>>>>>>  
>>>>>> explains the function well, including how it works. If you work through 
>>>>>> this in small steps, check examples and their (intermediate) results, 
>>>>>> you 
>>>>>> should be able to find what the original problem was. Best regards, 
>>>>>> Carsten 
>>>>>
>>>>> -- You received this message because you are subscribed to the Google 
>>>>> Groups "Django users" group. To unsubscribe from this group and stop 
>>>>> receiving emails from it, send an email to 
>>>>> [email protected]. To post to this group, send email 
>>>>> to [email protected]. Visit this group at 
>>>>> http://groups.google.com/group/django-users. To view this discussion 
>>>>> on the web visit 
>>>>> https://groups.google.com/d/msgid/django-users/9b529e2d-7e2b-4194-a77c-8434efe6205d%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/django-users/9b529e2d-7e2b-4194-a77c-8434efe6205d%40googlegroups.com?utm_medium=email&utm_source=footer>.
>>>>>  
>>>>> For more options, visit https://groups.google.com/d/optout. 
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Django users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/django-users.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/django-users/889c6480-98b3-415d-af92-490d11de5695%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/django-users/889c6480-98b3-415d-af92-490d11de5695%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Django users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/django-users.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/django-users/3cea33db-f2e7-4739-a202-99a717bda092%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/django-users/3cea33db-f2e7-4739-a202-99a717bda092%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Wildman and Herring Limited, Registered Office: 52 Great Eastern Street, 
> London, EC2A 3EP, Company no: 05766374
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/9cf79d88-55f6-4b87-9cc4-7050b14a52fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to