I have been struggling with this one for quite some time, finally giving up
and asking here.
I have a page which uses an iframe, which is totally JS created (no URLs to
create it, uses SAPUI5).
The body, when I request the page is this:
<body class="sapUiBody" role="application">
<div id="ctrRoot"></div>
</body>
First, JS executes and creates:
<body class="sapUiBody" role="application" style="margin: 0px;">
<div id="ctrRoot" data-sap-ui-area="ctrRoot">
<div id="__shell0" data-sap-ui="__shell0" class="sapDkShell
sapUiUx3Shell sapUiUx3ShellDesignStandard sapUiUx3ShellFullHeightContent
sapUiUx3ShellHeadStandard sapUiUx3ShellNoContentPadding">
... Lots of crap here ...
</div>
</div>
</body>
Eventually, the following gets added in the ... Lots of crap here ....
section with many nested <div> tags
<div id="demokitSplitter_secondPane" class=
"sapUiVSplitterSecondPane" style="overflow: hidden; width: 79.7396%;">
<iframe id="content" name="content" src="about:blank" frameborder=
"0" onload="sap.ui.demokit.DemokitApp.getInstance().onContentLoaded();"
data-sap-ui-preserve="content">
</iframe>
</div>
This is the part that has the iframe.
Eventually, the iframe is replaced with:
<div id="demokitSplitter_secondPane" class=
"sapUiVSplitterSecondPane" style="overflow: hidden; width: 79.7396%;">
<iframe id="content" name="content" src="about:blank" frameborder=
"0" onload="sap.ui.demokit.DemokitApp.getInstance().onContentLoaded();"
data-sap-ui-preserve="content">
<html xml:lang="en" lang="en" data-highlight-query-terms="pending">
<body>
<div id="main">
<div id="content">
<div class="full-description">
</div>
<div class="summary section">
<div class="sectionItems">
<div class="sectionItem itemName namespace static">
<b class="icon" title="Analysis Path Framework">
<a href="test.html">test</a>
</b>
<span class="description">Analysis Path
Framework</span>
</div>
<div class="sectionItem itemName namespace static">
<b class="icon" title="Test2">
<a href="test.html">test2</a>
</b>
<span class="description">Test2</span>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
</iframe>
</div>
What I need to get access to:
<div class="sectionItems">
And cycle through all these:
<div class="sectionItem itemName namespace static">
<div class="sectionItem itemName namespace static">
I can't seem to get my PhantomJS downloader to work.
I have tried all the following attempts to try to wait to get that text:
def _response(self, _, driver, spider):
print 'PhantomJSDownloadHandler _response writing first.html,
possibly empty html (due to AJAX) %s' %(time.asctime( time.localtime(time.
time()) ))
target = codecs.open('first.html', 'w', "utf-8")
target.truncate()
target.write(driver.page_source)
target.close()
try: print 'PhantomJSDownloadHandler waiting for
sectionTitles %s' %(time.asctime( time.localtime(time.time()) ))
max_time_to_wait_sec = 20
time_between_polls_milli = 2
#element = WebDriverWait(driver, max_time_to_wait_sec,
time_between_polls_milli).until(EC.presence_of_element_located((By.CLASS_NAME,
"sectionItems")))
#element = WebDriverWait(driver,
max_time_to_wait_sec).until(EC.presence_of_element_located((By.CLASS_NAME,
"sapUiVSplitterSecondPane")))
#element =
self.driver.find_elements_by_xpath('//div[@class="sectionItems"]')
#element = self.driver.find_elements_by_xpath('//iframe')
#WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.visibility_of(element))
#WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.frame_to_be_available_and_switch_to_it(By.id("content")))
WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.
frame_to_be_available_and_switch_to_it((By.id, "content")))
#WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.visibility_of_element_located(By.CLASS_NAME,
"sectionItems"))
Some of the posts on stackoverflow talk about this:
http://stackoverflow.com/questions/25057174/scrapy-crawl-in-order
def parse(self, response):
for link in response.xpath("//article/a/@href").extract():
yield Request(link, callback=self.parse_page, meta={'link':link})
def parse_page(self, response):
for frame in response.xpath("//iframe").extract():
item = MyItem()
item['link'] = response.meta['link']
item['frame'] = frame
yield item
But this looks like it is trying to fetch a link (URL) but my iframe does
it via a JS function, not a URL.
Now, assuming someone can actually help me with the downloader, so it can
wait until the sectionItems div is available.
Then in Scrapy, I need to iterate through those results. I have this code
written:
# Working, finds first SectionsItems
print 'checking for <div class="sectionItems">'sectionItems = namespace.
xpath(".//div[@class='summary section']/div[@class='sectionItems']")
#sections = hxs.xpath("//div[@class='sectionItem']")
#sections = hxs.xpath("//div[contains(@class, 'sectionItem itemName
namespace static')]")
#sections = hxs.xpath("//<div class="sectionTitle">Namespaces &
Classes</div>/div[@class='sectionItems']")
print 'xpath SectionItems:%s' %sectionItems
for sectionItem in sectionItems:
print 'Found SectionItem:'
#sections = sectionItem.xpath("div[@class='sectionItem']")
sections = sectionItem.xpath("div[re:test(@class, 'sectionItem')]")
#sections = sectionItem.xpath("div[re:test(@class, 'sectionItem
itemName namespace static')]")
for section in sections:
print 'Found Section:%s' %(section.extract())
Any help is greatly appreciated.
David
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.