Re: [xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki

Pascal Voitot Thu, 18 Jun 2009 23:51:29 -0700

On Thu, Jun 18, 2009 at 9:04 PM, Vincent Massol <vinc...@massol.net> wrote:


> Hi Niels,
>
> You could easily call $xwiki.getExternalURL() which returns the
> content at a URL.
> Then you can use our XHTML parser to generate a XDOM and then do
> whatever you want with it.
>
> Only little issue: the renderer is not available in the xwiki content
> right now. But if you're doing groovy it should be easy.
>
> For large document we can add a method easily in Parser interface:
> parser(Reader, Listener). All you'd need to do is implement Listener a
> groovy script for ex and you'd get called for each element in the page.
>
> Thanks
> -Vincent
>

I agree with Vincent... Groovy is the easiest solution...
In the past, I tried another "weird" solution consisting in integrating a
JavaScript rendering engine on the serverside such as rhino... then
manipulating a DOM in Javascript was quite natural and I could use great
APIs such as prototype... It worked quite well but I'm not sure about the
performance and memory issues but I found this idea funny: Javascript on
serverside... This might seem a bit "heretic" to say that but there are some
products on the market proposing to build websites with javascript on client
and server side...



>
> On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote:
>
> > Is there anything like the Xwiki-feed-plugin except that instead of
> > fetching
> > a feed, it would fetch an HTML document via HTTP, returning a DOM
> > structure
> > that can be scanned or filtered by API-calls, e.g.:
> >
> > $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://
> > nielsmayer.com")
> > $images = $fetchedDom.getImgList()
> > $media =  $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",
> > ".mp4"])
> > $content = $fetchedDom.getDivListById(['xwikicontent, 'container',
> > 'content'])
> >
> > Since this would happen on the server, you'd probably need to "fake"
> > being a
> > real browser (or just capture the user's browser configuration and
> > pass it
> > via the call to the hypothetical "getDocumentDOM()" in order to
> > capture an
> > accurate scraped representation of a modern site.)
> >
> > The existing examples I've seen store an Xwiki document in the
> > database
> > first. I was hoping there was an "in memory" option that would allow
> > for the
> > document to be maintained in the app's context for long enough to
> > process
> > the remaining stream of plugin calls such as "getDivListById()" or
> > "getAnchorHREFsByExtension()" and then appropriately dispose the DOM
> > when no
> > longer referenced, via garbage collection. Maybe compared to the
> > implementation headaches -- of retrieving a potentially large
> > document into
> > memory incrementally, parsing it into a DOM incrementally, making that
> > available in the context, etc -- maybe I should just write the damn
> > document
> > into the database, scrape it, and delete it.
> >
> > Since I would use Xwiki to store a JSON "scrape" of the document in
> > the DB
> > (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0]
> > of the
> > retrieved document, and then just delete the wiki-contents after
> > scraping.... So actually, if anybody has any suggestions for
> > "scraping" with
> > a retrieved document, stored as Xwiki doc, please, suggest as well!
> > This
> > seems like an area potentially fraught with peril that many people
> > have
> > already dealt with, so I would appreciate advice.
> >
> > Thanks,
> >
> > Niels
> > http://nielsmayer.com
> _______________________________________________
> devs mailing list
> devs@xwiki.org
> http://lists.xwiki.org/mailman/listinfo/devs
>
_______________________________________________
devs mailing list
devs@xwiki.org
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki

Reply via email to