Re: [xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki

Vincent Massol Thu, 18 Jun 2009 12:05:00 -0700

Hi Niels,

You could easily call $xwiki.getExternalURL() which returns the  
content at a URL.
Then you can use our XHTML parser to generate a XDOM and then do  
whatever you want with it.


Only little issue: the renderer is not available in the xwiki content  
right now. But if you're doing groovy it should be easy.

For large document we can add a method easily in Parser interface:  
parser(Reader, Listener). All you'd need to do is implement Listener a  
groovy script for ex and you'd get called for each element in the page.

Thanks
-Vincent

On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote:

> Is there anything like the Xwiki-feed-plugin except that instead of  
> fetching
> a feed, it would fetch an HTML document via HTTP, returning a DOM  
> structure
> that can be scanned or filtered by API-calls, e.g.:
>
> $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http:// 
> nielsmayer.com")
> $images = $fetchedDom.getImgList()
> $media =  $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",  
> ".mp4"])
> $content = $fetchedDom.getDivListById(['xwikicontent, 'container',
> 'content'])
>
> Since this would happen on the server, you'd probably need to "fake"  
> being a
> real browser (or just capture the user's browser configuration and  
> pass it
> via the call to the hypothetical "getDocumentDOM()" in order to  
> capture an
> accurate scraped representation of a modern site.)
>
> The existing examples I've seen store an Xwiki document in the  
> database
> first. I was hoping there was an "in memory" option that would allow  
> for the
> document to be maintained in the app's context for long enough to  
> process
> the remaining stream of plugin calls such as "getDivListById()" or
> "getAnchorHREFsByExtension()" and then appropriately dispose the DOM  
> when no
> longer referenced, via garbage collection. Maybe compared to the
> implementation headaches -- of retrieving a potentially large  
> document into
> memory incrementally, parsing it into a DOM incrementally, making that
> available in the context, etc -- maybe I should just write the damn  
> document
> into the database, scrape it, and delete it.
>
> Since I would use Xwiki to store a JSON "scrape" of the document in  
> the DB
> (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0]  
> of the
> retrieved document, and then just delete the wiki-contents after
> scraping.... So actually, if anybody has any suggestions for  
> "scraping" with
> a retrieved document, stored as Xwiki doc, please, suggest as well!  
> This
> seems like an area potentially fraught with peril that many people  
> have
> already dealt with, so I would appreciate advice.
>
> Thanks,
>
> Niels
> http://nielsmayer.com
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki

Reply via email to