Re: not getting same results as shown in tutorial

Travis Leleu Sat, 08 Aug 2015 15:06:13 -0700

Hi Malik,

Apologies if it came off a bit rough.  I didn't really mean to direct that
at you in particular -- I have no idea who you are, or your history, and it
would be presumptuous otherwise.


I meant more in a general sense.  Scraping is very rewarding, can be quite
lucrative, but really requires a lot of patience.  Essentially you're
reverse engineering a system that might not want you reverse engineering
it.  And with all the js frontend technologies so often misapplied,
breaking the web's semantics, sometimes you have to reverse engineer
something quite ugly.

So my opinion is that perserverance is probably the most important mindset
a scraper / data engineer can have.  You need to have a lot of tricks in
your toolbelt -- countering blocking, how to rate limit so as to not impact
sites, how to proxify requests, js scraping, ajax reverse engineering, plus
a billion other tricks in the toolbox.

It's why I love the topic -- there's so much to know, and every situation
requires a slightly different application of techniques.  But once you have
the right combination, it sure is fun when the data starts to fly in!

On Sat, Aug 8, 2015 at 2:33 PM, Malik Rumi <malik.a.r...@gmail.com> wrote:

> Because it is so rare, I always try to take the time to point out when
> someone actually answers the question I ask, and thank them for it. So
> thank you for not only answering but providing reasonable explanations for
> the difficulties I encountered. PLUS it was really fast!  ☺
>
> I do have a minor quibble with your description of my resilience, because
> I am self taught in coding and have come quite a long way through a lot
> challenges, but there's no way for you to know that and anyway it's not a
> huge deal. Then why bring it up? To honor the fact that I *have* been
> sticking this out. In other words, for me, not you or anyone else reading
> this.
>
> You pose an interesting challenge to get involved, and that's both good
> and valid. and more people should both issue such challenges as well as
> take them up.
>
> On Sat, Aug 8, 2015 at 1:59 PM, Travis Leleu <m...@travisleleu.com> wrote:
>
>> It's possible DMOZ updated their layout in their HTML since the tutorial
>> was written.  It's also possible that underlying libraries change how they
>> process or remove text.  Most likely, the tutorial was written for scrapy
>> v0.2x, and you're probably using a more recent one.
>>
>> Really, though, you're going to need to adopt a more resilient attitude
>> in order to succeed at data scraping.  Trust your judgement -- you have
>> strings with whitespace.  So call str.strip().  You have \r\n to remove?
>>  str.replace( '\r\n', '' ).
>>
>> The scrapy docs are of varying quality.  The development group has been
>> pushing to get 1.0 out the door, so some things have changed and they
>> haven't gotten the opportunity to update the documentation.
>>
>> I completely agree that its very frustrating when you're trying to learn
>> something and the documentation doesn't match what you see.  A big part of
>> learning is the feedback between trying something, and comparing to an
>> established / documented result.
>>
>> This is a great opportunity for you to contribute back to a project that
>> you derive value back from.  I see many posts on this list asking how to
>> get involved -- perhaps this can be a call to action for anyone interested.
>>   Updating the tutorial is probably the single most important thing for the
>> entire project because that's where most users dip their toe in the water
>> to test it out.
>>
>> A bad first experience likely discourages many first time users.
>>
>> On Sat, Aug 8, 2015 at 11:36 AM, Malik Rumi <malik.a.r...@gmail.com>
>> wrote:
>>
>>> Here is my code:
>>>
>>> import scrapy
>>>
>>> from tutorial.items import DmozItem
>>>
>>> class DmozSpider(scrapy.Spider):
>>>     name = "dmoz"
>>>     allowed_domains = ["dmoz.org"]
>>>     start_urls = [
>>>         "
>>> http://www.dmoz.org/Computers/Programming/Languages/Python/Books/";,
>>>         "
>>> http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/";
>>>     ]
>>>
>>>     def parse(self, response):
>>>         for sel in response.xpath('//ul/li'):
>>>             item = DmozItem()
>>>             item['title'] = sel.xpath('a/text()').extract()
>>>             item['link'] = sel.xpath('a/@href').extract()
>>>             item['desc'] = sel.xpath('text()').extract()
>>>             yield item
>>>
>>> Here is the code from the tutorial:
>>>
>>> import scrapy
>>> from tutorial.items import DmozItem
>>> class DmozSpider(scrapy.Spider):
>>>     name = "dmoz"
>>>     allowed_domains = ["dmoz.org"]
>>>     start_urls = [
>>>         "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/";,
>>>         
>>> "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/";
>>>     ]
>>>
>>>     def parse(self, response):
>>>         for sel in response.xpath('//ul/li'):
>>>             item = DmozItem()
>>>             item['title'] = sel.xpath('a/text()').extract()
>>>             item['link'] = sel.xpath('a/@href').extract()
>>>             item['desc'] = sel.xpath('text()').extract()
>>>             yield item
>>>
>>>
>>> I can't see any difference here, but the result shown in the tutorial is:
>>>
>>> [scrapy] DEBUG: Scraped from <200 
>>> http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
>>>      {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full 
>>> text, ASCII format. Asks for feedback. [author website, Gnosis Software, 
>>> Inc.\n],
>>>       'link': [u'http://gnosis.cx/TPiP/'],
>>>       'title': [u'Text Processing in Python']}
>>> [scrapy] DEBUG: Scraped from <200 
>>> http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
>>>      {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 
>>> 0130211192, has CD-ROM. Methods to build XML applications fast, Python 
>>> tutorial, DOM and SAX, new Pyxie open source XML processing library. 
>>> [Prentice Hall PTR]\n'],
>>>       'link': 
>>> [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
>>>       'title': [u'XML Processing with Python']}
>>>
>>>
>>> But my result looks like this:
>>>
>>> 2015-08-08 13:14:55 [scrapy] DEBUG: Scraped from <200
>>> http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
>>> {'desc': [u'\r\n\t\r\n                                ',
>>>           u' \r\n\t\t\t\r\n                                - By David
>>> Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for
>>> feedback. [author website, Gnosis Software,
>>> Inc.]\r\n
>>> \r\n                                ',
>>>           u'\r\n                                '],
>>>  'link': [u'http://gnosis.cx/TPiP/'],
>>>  'title': [u'Text Processing in Python']}
>>> 2015-08-08 13:14:55 [scrapy] DEBUG: Scraped from <200
>>> http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
>>> {'desc': [u'\r\n\t\r\n                                ',
>>>           u' \r\n\t\t\t\r\n                                - By Sean
>>> McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to
>>> build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open
>>> source XML processing library. [Prentice Hall
>>> PTR]\r\n                                \r\n
>>> ',
>>>           u'\r\n                                '],
>>>  'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'
>>> ],
>>>  'title': [u'XML Processing with Python']}
>>>
>>>
>>> Actually, my result is *worse* than this. I just gave you a snippet to
>>> match what is in the tutorial. But actually, *I have the whole dmoz
>>> page with LOTS and LOTS of newlines, whitespace, and so on. *
>>>
>>> The tutorial does not say anything about running strip() or something
>>> like it, so how did they get this result and I got what I got? Further, the
>>> tutorial says:
>>>
>>> After inspecting the page source, you’ll find that the web site’s
>>>> information is inside a <ul> element, in fact the *second* <ul>
>>>> element.
>>>>
>>>
>>> When I look at the source, the information is in the *fourth* <ul>
>>> element. Maybe I can't count, maybe the writers of the tutorial can't
>>> count, or maybe the page has changed, but I can't see how the change from
>>> 2nd to 4th alone would account for all this whitespace.
>>>
>>> I tried indexing to see if that would narrow the result:
>>>
>>>         for sel in response.xpath('//ul[4]/li'):
>>>
>>> but that and [3] got me no data. [2] got me the same data as no index
>>> reference at all.
>>>
>>> So if someone can help me understand why I got all this whitespace, \t,
>>> \n, and \r, and how to eliminate them, I would be very happy.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to scrapy-users+unsubscr...@googlegroups.com.
>>> To post to this group, send email to scrapy-users@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "scrapy-users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/scrapy-users/AIU_bi5oUzQ/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> scrapy-users+unsubscr...@googlegroups.com.
>> To post to this group, send email to scrapy-users@googlegroups.com.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: not getting same results as shown in tutorial

Reply via email to