Re: CrawlSpider fails to follow rule for some websites

Aru Sahni Wed, 05 Nov 2014 06:55:57 -0800

You can just invoke BeautifulSoup as one normally would and not use
Scrapy's built-in functionality.


~A

On Wed, Nov 5, 2014 at 9:51 AM, Michele Coscia <[email protected]>
wrote:

> Bingo, that's it, you are great.
> So it is what exits from the Selector(response) that is the problem,
> because response contains the entire malformed html (as it should).
>
> I tried a little test, feeding the malformed html to Beautiful soup: lxml
> parser still fails, html5lib instead parses correctly. So, the question is:
> how do I use html5lib's parser instead of lxml in Scrapy? The
> documentation
> <http://doc.scrapy.org/en/latest/faq.html#how-does-scrapy-compare-to-beautifulsoup-or-lxml>
> tells me that "you can easily use BeautifulSoup
> <http://www.crummy.com/software/BeautifulSoup/> (or lxml <http://lxml.de/>)
> instead", but it doesn't say how :-)
>
> Finally: I'd dare to say that this is a bug and it should be reported as
> such. If any browser and html5lib can parse the page, then so should
> Scrapy. Do you think I should submit it on the Github page?
>
> Thanks, you have been already very helpful!
> Michele C
>
>
>
>
> Il giorno mercoledì 5 novembre 2014 06:20:26 UTC-5, Rocío Aramberri ha
> scritto:
>>
>> Hi Michele,
>>
>> I've been investigating further in your problem and looks like the html
>> in http://www.mass.gov/eea/agencies/dfg/der/ is malformed.  You can see
>> here what part of the html is really reaching extract_links: http://
>> pastebin.com/6kTT5Amt (there is an </html> at the end of it). This page
>> has 4 html definitions.
>>
>> Hope this helps,
>> Kind Regards,
>> Rocio
>>
>> On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]>
>> wrote:
>>
>>>
>>> By doing some debugging in ipdb I found out that the extract_links
>>> function in the class LxmlLinkExtractor is not getting the same data I
>>> see in the scrapy shell. While in the scrapy shell I see the correct data
>>> inside the <body> tag, when I see at the html variable in extract_links
>>> I see:
>>>
>>> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t
>>>
>>> I *know* that both the scrapy shell and my script are getting the very
>>> same data from the server (checked with wireshark). So somewhere in between
>>> the fetching of the data and the extract_links function, the content of the
>>> body disappears.
>>>
>>> Someone with knowledge about the source code can tell me which function
>>> calls LxmlLinkExtractor's extract_links?
>>>
>>> Thanks!
>>> Michele C
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to