On Thu, Jun 18, 2009 at 9:04 PM, Vincent Massol <vinc...@massol.net> wrote:
> Hi Niels, > > You could easily call $xwiki.getExternalURL() which returns the > content at a URL. > Then you can use our XHTML parser to generate a XDOM and then do > whatever you want with it. > > Only little issue: the renderer is not available in the xwiki content > right now. But if you're doing groovy it should be easy. > > For large document we can add a method easily in Parser interface: > parser(Reader, Listener). All you'd need to do is implement Listener a > groovy script for ex and you'd get called for each element in the page. > > Thanks > -Vincent > I agree with Vincent... Groovy is the easiest solution... In the past, I tried another "weird" solution consisting in integrating a JavaScript rendering engine on the serverside such as rhino... then manipulating a DOM in Javascript was quite natural and I could use great APIs such as prototype... It worked quite well but I'm not sure about the performance and memory issues but I found this idea funny: Javascript on serverside... This might seem a bit "heretic" to say that but there are some products on the market proposing to build websites with javascript on client and server side... > > On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote: > > > Is there anything like the Xwiki-feed-plugin except that instead of > > fetching > > a feed, it would fetch an HTML document via HTTP, returning a DOM > > structure > > that can be scanned or filtered by API-calls, e.g.: > > > > $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http:// > > nielsmayer.com") > > $images = $fetchedDom.getImgList() > > $media = $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4", > > ".mp4"]) > > $content = $fetchedDom.getDivListById(['xwikicontent, 'container', > > 'content']) > > > > Since this would happen on the server, you'd probably need to "fake" > > being a > > real browser (or just capture the user's browser configuration and > > pass it > > via the call to the hypothetical "getDocumentDOM()" in order to > > capture an > > accurate scraped representation of a modern site.) > > > > The existing examples I've seen store an Xwiki document in the > > database > > first. I was hoping there was an "in memory" option that would allow > > for the > > document to be maintained in the app's context for long enough to > > process > > the remaining stream of plugin calls such as "getDivListById()" or > > "getAnchorHREFsByExtension()" and then appropriately dispose the DOM > > when no > > longer referenced, via garbage collection. Maybe compared to the > > implementation headaches -- of retrieving a potentially large > > document into > > memory incrementally, parsing it into a DOM incrementally, making that > > available in the context, etc -- maybe I should just write the damn > > document > > into the database, scrape it, and delete it. > > > > Since I would use Xwiki to store a JSON "scrape" of the document in > > the DB > > (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0] > > of the > > retrieved document, and then just delete the wiki-contents after > > scraping.... So actually, if anybody has any suggestions for > > "scraping" with > > a retrieved document, stored as Xwiki doc, please, suggest as well! > > This > > seems like an area potentially fraught with peril that many people > > have > > already dealt with, so I would appreciate advice. > > > > Thanks, > > > > Niels > > http://nielsmayer.com > _______________________________________________ > devs mailing list > devs@xwiki.org > http://lists.xwiki.org/mailman/listinfo/devs > _______________________________________________ devs mailing list devs@xwiki.org http://lists.xwiki.org/mailman/listinfo/devs