Here I will try to better explain my idea :
- In my webmaster working days, I have many repetitive "clic action" to do.
hummmmm, a little boring, so go to play :
-- ruby (http://en.wikipedia.org/wiki/Ruby_(programming_language) )
-- mecanize http://mechanize.rubyforge.org/mechanize/
-- hpricot (xml parser)
Some lines of code after... and I'm an happy webmaster.
But not really in fact. Now I would like to do less code and more "just
instructions". Pass instruction by xml could be very nice.
Consider this use case :
- I have the "enterprise web yellow page" (nearly an LDAP) and my
enterprise CMS (no "dev" solutions possibles - JUST clic), and I have to
pass some informations to yellow-page to CSM.
- so in a cool "droids world", i would like to do something like that :
- write an droid-configuration.xml : set witch worker, configure link depth
following, set the DelayTimer is seconds,...
- write a droids-job.xml : go to this page, fill this form, select links in
{xpath}, follow this link, extract the {xpath} add save, go to this page
and fill the form with saved informations.
... With that, a really happy webmaster ! :)
What do you think about that ?
Asta luego
On Tue, 14 Jul 2009 09:56:33 +0200, Thorsten Scherler
<[email protected]> wrote:
> On Mon, 2009-07-13 at 16:49 +0200, Florent André wrote:
>> Hi Droids list !
>>
>> After a speak during the Lenya meeting with Senior Thorsten (Olé !:) ),
>> I
>> would like to have more informations about droids.
>
> Bonjour Monsieur Florent, bienvenido a Droids. ;)
>
>> I know that droids is not only a web crawler (and I would like to use it
>> for other think), but my immediate need is about crawling...
>
> What comes know as xml document I will try to put it in terms of droids.
> I guess putting it in our wiki http://cwiki.apache.org/DROIDS/ will be
> helpful for future references.
>
>> So let's go :
>>
>> I would like to pass to droids an xml like (just an example) :
>> <article>
>> <droids:url>http://example.com/test.html</droids:url>
>
> In droids crawling the url is the entrance point of the processing. What
> happens then is highly configurable and currently Ming Fai has suggested
> some changes for the future. I will describe the possibilities that
> droids currently offers for the presented use case.
>
> Like said we start with the queue where you inject the starting urls.
> Then this queue will call a worker (which basically is the part of the
> code where the real work is done). This worker may call a linkExtractor
> and/or a Parser to extract link and any other information about the
> incoming page.
>
>> <title>
>>
>>
<droids:xpath>html/body/d...@id='content']/d...@id='title']/h1</droids:xpath>
>> </title>
>> <firstparagraph>
>>
>>
<droids:xpath>html/body/d...@id='content']/d...@id='article']/p[position()=1]</droids:xpath>
>> </firstparagraph>
>> <othertext>
>>
>>
<droids:xpath>html/body/d...@id='content']/d...@id='article']/p[position()>1]</droids:xpath>
>> </othertext>
>> </article>
>>
>> and that droids give me someting like :
>> <article>
>> <title> this is the article title </article>
>> <firstparagraph> This article is about the....</firstparagraph>
>> <othertext>bla bla bla bla bla...</othertext>
>> </article>
>
> You could use a simple xsl transformation for that. You can develop the
> xsl stylesheet (basically the xpaths) to extract the info with lenya as
> usual. Just use a generator to get the source and then add the
> transformer which will return the above doc. This stylesheet you would
> copy to your droids plugin and use it to generate a result outputstream.
> This stream you would pass to save handler of droids which then saves
> you the stream to the location you want.
>
>> So my questions are :
>>
>> 1) It's possible ?
>
> Yes certainly.
>
>>
>> 2) If yes, I will have to (think that I'm not a java's SuperStar) :
>> a) install droids, type 2 commands lines, and let's go (1 hour work)
>
> No, droids is a very loose framework and we do not have the specific use
> case you ask for in our code base (maybe afterwards). ;)
>
>> b) install droids, really understand understand how droids work, code
>> some classes (3 weeks work)
>
> jeje, that is most valuable, but for your use case should not be
> necessary.
>
>> c) install droids, create a class from existing one, doing some try
>> error (4-5 days work)
>
> Yeah, I guess that is realistic with testing and so on.
>
>> d) ...
>>
>> 3) It's difficult to plug droids into a Lenya (based on cocoon) app ?
>
> Actually not at all. I recommend to first code your bot in droids then
> generate the jar and copy it to your lenya module. Do not forget the
> dependencies that your droids may have and add them to the lib dir of
> your module.
>
> HTH to get you the general idea.
>
> salu2
>
>>
>> Thanks for your answer,
>>
>> Regards