Hi Travis, Thanks for the reply.
I am thinking of efficient using of scrapy. You've written that scrapy uses Twisted, is it by default or I need to write a special chunk of code like here <http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script> to run scrapy inside the Twisted reactor. And there is one more thing I am concerned with. How to deal with PDF downloading while scraping some HTML content. I need to scrape some knowledge from the page and among them is some text and sometimes pdf/csv/txt file. Can I do it with special pipeline methods inside MyPipeline.py ? Now I just put text data to json file class EcolexPipeline(object): def __init__(self): self.file = codecs.open('ecolex.json', 'w', encoding='utf-8') def process_item(self, item, spider): line = json.dumps(dict(item), ensure_ascii=False) + "\n" self.file.write(line) if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['id']) return item def spider_closed(self, spider): self.file.close() How to put in the meantime pdf files to special directory? Best wishes, Szymon Roziewski On Friday, 17 October 2014 14:45:32 UTC+2, Szymon Roziewski wrote: > > Hi scrapy people, > > I am quite new to scrapy. I have done one script which works and I am > developing it. > > Could you explain me one thing please. > > If I have such code > rules = [ > Rule(LxmlLinkExtractor(allow=("ecolex/ledge/view/SearchResults", > )), follow=True), > Rule(LxmlLinkExtractor (allow=("ecolex/ledge/view/RecordDetails", > )), callback='found_items'), > ] > > what happens actually? > > For each phrases all links will be extracted and for SearchResults spider > would only follow such links until reaches all links. > > If on the website a link with pattern RecordDetails is seized, spider > would apply a method 'found_items' for further processing. > > The thing is about task scheduling here. > > Does it happen sequentially or in parallel ? > > I mean, spider scrapes some data from a site with pattern RecordDetails > and after all scraped items switches to follow another link and scrapes? > > This is something automagical. How scrapy knows what to do first, to > scrape or to follow? > > Is it sequential job: > > following one site -> scraping all content > following second site -> scraping all content > > Or we have some parallelization like: > following one site -> scraping all content & following second site -> > scraping all content > > I would like to make it the latter style if it is not like this. > > The question is how could I do it? > > Regards, > Szymon Roziewski > > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.