I have been struggling with this one for quite some time, finally giving up 
and asking here.

I have a page which uses an iframe, which is totally JS created (no URLs to 
create it, uses SAPUI5).

The body, when I request the page is this:
 
<body class="sapUiBody" role="application">
    <div id="ctrRoot"></div>
</body>


First, JS executes and creates:

<body class="sapUiBody" role="application" style="margin: 0px;">
  <div id="ctrRoot" data-sap-ui-area="ctrRoot">
    <div id="__shell0" data-sap-ui="__shell0" class="sapDkShell 
sapUiUx3Shell sapUiUx3ShellDesignStandard sapUiUx3ShellFullHeightContent 
sapUiUx3ShellHeadStandard sapUiUx3ShellNoContentPadding">
... Lots of crap here ...
    </div>
 </div>
</body>

 

Eventually, the following gets added in the ... Lots of crap here .... 
section with many nested <div> tags

        <div id="demokitSplitter_secondPane" class=
"sapUiVSplitterSecondPane" style="overflow: hidden; width: 79.7396%;">
          <iframe id="content" name="content" src="about:blank" frameborder=
"0" onload="sap.ui.demokit.DemokitApp.getInstance().onContentLoaded();" 
data-sap-ui-preserve="content">
          </iframe>
        </div>



This is the part that has the iframe.

Eventually, the iframe is replaced with:

        <div id="demokitSplitter_secondPane" class=
"sapUiVSplitterSecondPane" style="overflow: hidden; width: 79.7396%;">
          <iframe id="content" name="content" src="about:blank" frameborder=
"0" onload="sap.ui.demokit.DemokitApp.getInstance().onContentLoaded();" 
data-sap-ui-preserve="content">


<html xml:lang="en" lang="en" data-highlight-query-terms="pending">
    <body>
        <div id="main">
            <div id="content">
                <div class="full-description">
                </div>
                <div class="summary section">
                    <div class="sectionItems">
                        <div class="sectionItem itemName namespace static">
                            <b class="icon" title="Analysis Path Framework">
                                <a href="test.html">test</a>
                            </b>
                            <span class="description">Analysis Path 
Framework</span>
                        </div>
                        <div class="sectionItem itemName namespace static">
                            <b class="icon" title="Test2">
                                <a href="test.html">test2</a>
                            </b>
                            <span class="description">Test2</span>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </body>
</html>




          </iframe>
        </div>


What I need to get access to:
                    <div class="sectionItems">


And cycle through all these:
                        <div class="sectionItem itemName namespace static">
                        <div class="sectionItem itemName namespace static">


I can't seem to get my PhantomJS downloader to work.

I have tried all the following attempts to try to wait to get that text:


    def _response(self, _, driver, spider):
        print 'PhantomJSDownloadHandler _response writing first.html, 
possibly empty html (due to AJAX) %s' %(time.asctime( time.localtime(time.
time()) ))
        target = codecs.open('first.html', 'w', "utf-8")
        target.truncate()
        target.write(driver.page_source)
        target.close()


        try:            print 'PhantomJSDownloadHandler waiting for 
sectionTitles %s' %(time.asctime( time.localtime(time.time()) ))
            max_time_to_wait_sec = 20
            time_between_polls_milli = 2


            #element = WebDriverWait(driver, max_time_to_wait_sec, 
time_between_polls_milli).until(EC.presence_of_element_located((By.CLASS_NAME, 
"sectionItems")))
            #element = WebDriverWait(driver, 
max_time_to_wait_sec).until(EC.presence_of_element_located((By.CLASS_NAME, 
"sapUiVSplitterSecondPane")))
            #element = 
self.driver.find_elements_by_xpath('//div[@class="sectionItems"]')
            #element = self.driver.find_elements_by_xpath('//iframe')
            
#WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.visibility_of(element))
            
#WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.frame_to_be_available_and_switch_to_it(By.id("content")))
            WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.
frame_to_be_available_and_switch_to_it((By.id, "content")))
            
#WebDriverWait(self.driver,20,poll_frequency=.2).until(EC.visibility_of_element_located(By.CLASS_NAME,
 
"sectionItems"))


Some of the posts on stackoverflow talk about this: 

http://stackoverflow.com/questions/25057174/scrapy-crawl-in-order



def parse(self, response):
    for link in response.xpath("//article/a/@href").extract():
        yield Request(link, callback=self.parse_page, meta={'link':link})


def parse_page(self, response):
    for frame in response.xpath("//iframe").extract():
        item = MyItem()
        item['link'] = response.meta['link']
        item['frame'] = frame
        yield item





But this looks like it is trying to fetch a link (URL) but my iframe does 
it via a JS function, not a URL.



Now, assuming someone can actually help me with the downloader, so it can 
wait until the sectionItems div is available.

Then in Scrapy, I need to iterate through those results.  I have this code 
written:


# Working, finds first SectionsItems


print 'checking for <div class="sectionItems">'sectionItems = namespace.
xpath(".//div[@class='summary section']/div[@class='sectionItems']")
#sections = hxs.xpath("//div[@class='sectionItem']")
#sections = hxs.xpath("//div[contains(@class, 'sectionItem itemName 
namespace static')]")
#sections = hxs.xpath("//<div class="sectionTitle">Namespaces &amp; 
Classes</div>/div[@class='sectionItems']")
                                                                            
                                     
print 'xpath SectionItems:%s' %sectionItems
                                                                            
                                     
for sectionItem in sectionItems:
    print 'Found SectionItem:'
    #sections = sectionItem.xpath("div[@class='sectionItem']")
    sections = sectionItem.xpath("div[re:test(@class, 'sectionItem')]")
    #sections = sectionItem.xpath("div[re:test(@class, 'sectionItem 
itemName namespace static')]")
                                                                            
                                     
    for section in sections:
        print 'Found Section:%s' %(section.extract())




Any help is greatly appreciated.
David

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to