Hi Travis,

Thanks for the reply.

I am thinking of efficient using of scrapy.

You've written that scrapy uses Twisted, is it by default or I need to 
write a special chunk of code like here 
<http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script>
 
to run scrapy inside the Twisted reactor. 

And there is one more thing I am concerned with. 

How to deal with PDF downloading while scraping some HTML content.

I need to scrape some knowledge from the page and among them is some text 
and sometimes pdf/csv/txt file.

Can I do it with special pipeline methods inside MyPipeline.py ?

Now I just put text data to json file
class EcolexPipeline(object):

    def __init__(self):
        self.file = codecs.open('ecolex.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)

        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

    def spider_closed(self, spider):
        self.file.close()


How to put in the meantime pdf files to special directory?

Best wishes,
Szymon Roziewski



On Friday, 17 October 2014 14:45:32 UTC+2, Szymon Roziewski wrote:
>
> Hi scrapy people,
>
> I am quite new to scrapy. I have done one script which works and I am 
> developing it.
>
> Could you explain me one thing please.
>
> If I have such code 
>     rules = [
>         Rule(LxmlLinkExtractor(allow=("ecolex/ledge/view/SearchResults", 
> )), follow=True),
>         Rule(LxmlLinkExtractor (allow=("ecolex/ledge/view/RecordDetails", 
> )), callback='found_items'),
>     ]
>
> what happens actually?
>
> For each phrases all links will be extracted and for SearchResults spider 
> would only follow such links until reaches all links.
>
> If on the website a link with pattern RecordDetails is seized, spider 
> would apply a method 'found_items' for further processing.
>
> The thing is about task scheduling here.
>
> Does it happen sequentially or in parallel ? 
>
> I mean, spider scrapes some data from a site with pattern RecordDetails 
> and after all scraped items switches to follow another link and scrapes?
>
> This is something automagical. How scrapy knows what to do first, to 
> scrape or to follow?
>
> Is it sequential job:
>
> following one site -> scraping all content
> following second site -> scraping all content
>
> Or we have some parallelization like: 
> following one site -> scraping all content & following second site -> 
> scraping all content
>
> I would like to make it the latter style if it is not like this.
>
> The question is how could I do it?
>
> Regards,
> Szymon Roziewski
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to