Re: How does follow and rules work

Travis Leleu Fri, 17 Oct 2014 09:39:12 -0700

Scrapy crawls depth first by default because it uses a LIFO queue.  I don't
believe it's on a per-site basis; rather, it's the order a link is
discovered.  So if scrapy grabs a page that has the following links:

site1.com/link
site1.com/link_2
site2.com/
site1.com/link_3

I believe that is the order it will make the requests.

Does it happen sequentially or in parallel ?

Neither, although I think it's more correct to say in parallel.  Scrapy
uses the Twisted asynchronous library.  Scrapy is single threaded, so it's
only ever running one piece of code at a time.  Because HTTP requests are
extremely I/O bound, python can go and do other things in code if you don't
block on IO.

Put another way, scrapy fires a request to twisted, which makes the request
to the remote server.  Instead of doing nothing while waiting for the
response (usually in hundreds of milliseconds, or more -- an eternity),
your code can do other things (like make more http requests!  or process
the returns on previous requests!)  When a request returns from the remote
server, the response is sent back to the callback function.  Your code then
has control again, and can process the response.

Asynchronous calls to IO bound tasks (anything involving the network) are
hugely efficient compared to typical python blocking techniques.  Async is
so important to python because of the GIL (
https://wiki.python.org/moin/GlobalInterpreterLock) that python3 includes
an async low-level library as part of the standard libraries.

This is correct:

> For each phrases all links will be extracted and for SearchResults spider
> would only follow such links until reaches all links.
> If on the website a link with pattern RecordDetails is seized, spider
> would apply a method 'found_items' for further processing.

Anyway, it sounds like you have a pretty good grasp of scrapy, especially
for a beginner.  Best of luck to you!

On Fri, Oct 17, 2014 at 5:45 AM, Szymon Roziewski <
szymon.roziew...@gmail.com> wrote:

> Hi scrapy people,
>
> I am quite new to scrapy. I have done one script which works and I am
> developing it.
>
> Could you explain me one thing please.
>
> If I have such code
>     rules = [
>         Rule(LxmlLinkExtractor(allow=("ecolex/ledge/view/SearchResults",
> )), follow=True),
>         Rule(LxmlLinkExtractor (allow=("ecolex/ledge/view/RecordDetails",
> )), callback='found_items'),
>     ]
>
> what happens actually?
>
> For each phrases all links will be extracted and for SearchResults spider
> would only follow such links until reaches all links.
>
> If on the website a link with pattern RecordDetails is seized, spider
> would apply a method 'found_items' for further processing.
>
> The thing is about task scheduling here.
>
> Does it happen sequentially or in parallel ?
>
> I mean, spider scrapes some data from a site with pattern RecordDetails
> and after all scraped items switches to follow another link and scrapes?
>
> This is something automagical. How scrapy knows what to do first, to
> scrape or to follow?
>
> Is it sequential job:
>
> following one site -> scraping all content
> following second site -> scraping all content
>
> Or we have some parallelization like:
> following one site -> scraping all content & following second site ->
> scraping all content
>
> I would like to make it the latter style if it is not like this.
>
> The question is how could I do it?
>
> Regards,
> Szymon Roziewski
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How does follow and rules work

Reply via email to