Re: Roundabout way of scraping dynamic content.

Nikolaos-Digenis Karagiannis Tue, 29 Apr 2014 03:42:13 -0700

Bill implies that you will have to yield those Ajax requests yourself 
(though that misses the point of "dynamic"). Nothing stands on your way to 
do this (provided you have the headers and body for the request)
Regarding the "hidden" information, the scrapy top level package can not. 
XPath selectors (scrapy.selector.Selector) can: 
Selector(response).xpath("//comment()")
When constructing xpath expressions which describe elements, open the files 
with a plain text editor and not a browser that may alter the html to 
comply with the standard and/or eval any javascript leftovers.



On Monday, 28 April 2014 20:35:48 UTC+3, tom wrote:
>
> Hey Bill. 
>
> i found what I think to be articles discussing the nasa image/scrapy. 
> Yeah, it's not really doing the headless browser at all.. It's 
> "simulating" a piece of what the javascript returns from that given 
> page.. But for a complex dnamic site, still doesn't do a "real" 
> headless browser.. 
>
> thanks 
>
>
> On Mon, Apr 28, 2014 at 1:30 PM, bruce <bado...@gmail.com <javascript:>> 
> wrote: 
> > bill... 
> > 
> > not sure that's the same... ie, I don't think scrapy has a way to 
> > "wait" for an element to show up on a given page, based on the 
> > underlying ajax functions... 
> > 
> > I had talked to pablo about this awhile ago and he was saying scrapy 
> > couldn't handle this. Are you saying it now can?? 
> > 
> > This would be cool if it really can. 
> > 
> > 
> > On Mon, Apr 28, 2014 at 1:13 PM, Bill Ebeling 
> > <bille...@gmail.com<javascript:>> 
> wrote: 
> >> Scrapy sends a request to the ajax address just like it does for the 
> normal 
> >> webpage. You maintain data from one request to the other with the meta 
> dict. 
> >> 
> >> There was a tutorial on it a while back about scraping the nasa website 
> for 
> >> it's pic of the day.  Can't seem to find it, now though.  If you take a 
> look 
> >> at the link above, you can read all about it. 
> >> 
> >> 
> >> On Mon, Apr 28, 2014 at 1:01 PM, bruce <bado...@gmail.com <javascript:>> 
> wrote: 
> >>> 
> >>> I didn't think scrappy had the ability to run remote ajax, similar to 
> >>> casperjs/phantom/nodejs... 
> >>> 
> >>> Does scrappy run a headless browser process to accomplish this?? 
> >>> 
> >>> thanks 
> >>> 
> >>> 
> >>> On Mon, Apr 28, 2014 at 10:17 AM, Bill Ebeling 
> >>> <bille...@gmail.com<javascript:>> 
>
> >>> wrote: 
> >>> > Hey Mitch, 
> >>> > 
> >>> > At the risk of stating the obvious, Scrapy handles dynamic content 
> quite 
> >>> > well.  The general approach is to scrape the page, submit requests 
> for 
> >>> > the 
> >>> > ajax, stich the item together, submit it to the pipeline. 
> >>> > 
> >>> > That said, it's not complicated, but not trivial, either. 
> >>> > 
> >>> > To your specific point, the solution is either to regex it out, or 
> to 
> >>> > start 
> >>> > fiddling with the underlying html.  I would not personally download 
> >>> > someone 
> >>> > else's page and then put it on a server, since the js is still going 
> to 
> >>> > be 
> >>> > running and logging things and all that. 
> >>> > 
> >>> > If you want to look into writing a crawler that gets the dynamic 
> >>> > content, 
> >>> > start here: 
> http://doc.scrapy.org/en/latest/topics/request-response.html 
> >>> > and 
> >>> > pay special attention to the meta dict. 
> >>> > 
> >>> > If you want more help with the specific site, provide a link so we 
> can 
> >>> > see 
> >>> > it. 
> >>> > 
> >>> > Hope that helps. 
> >>> > 
> >>> > -- 
> >>> > You received this message because you are subscribed to the Google 
> >>> > Groups 
> >>> > "scrapy-users" group. 
> >>> > To unsubscribe from this group and stop receiving emails from it, 
> send 
> >>> > an 
> >>> > email to scrapy-users...@googlegroups.com <javascript:>. 
> >>> > To post to this group, send email to 
> >>> > scrapy...@googlegroups.com<javascript:>. 
>
> >>> > Visit this group at http://groups.google.com/group/scrapy-users. 
> >>> > For more options, visit https://groups.google.com/d/optout. 
> >>> 
> >>> -- 
> >>> You received this message because you are subscribed to a topic in the 
> >>> Google Groups "scrapy-users" group. 
> >>> To unsubscribe from this topic, visit 
> >>> https://groups.google.com/d/topic/scrapy-users/LyCuWu4ydeA/unsubscribe. 
>
> >>> To unsubscribe from this group and all its topics, send an email to 
> >>> scrapy-users...@googlegroups.com <javascript:>. 
> >>> To post to this group, send email to 
> >>> scrapy...@googlegroups.com<javascript:>. 
>
> >>> Visit this group at http://groups.google.com/group/scrapy-users. 
> >>> For more options, visit https://groups.google.com/d/optout. 
> >> 
> >> 
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups 
> >> "scrapy-users" group. 
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an 
> >> email to scrapy-users...@googlegroups.com <javascript:>. 
> >> To post to this group, send email to 
> >> scrapy...@googlegroups.com<javascript:>. 
>
> >> Visit this group at http://groups.google.com/group/scrapy-users. 
> >> For more options, visit https://groups.google.com/d/optout. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Roundabout way of scraping dynamic content.

Reply via email to