Hi:

How we can write a nutch crawler plugin to utilize native nutch plugins
and imitate the nutch crawler in Droids? Is nutch interested in such a
plugin?

Please test and report feedback to [EMAIL PROTECTED] I will happily
answer all mails there.


I am no expert.. I am still in the Nutch/Solr learning curve. However
I will try to summarize my
need and how I think droids can help (Not sure). Maybe this could
sprak something maybe not. Also I must add my java knowledge is
limited.

I been using as I mentioned Python based crawler/aggregator. I crawl
XML files convert them to Solr XML (fields, boost etc..) and use Solr
for indexing, searching etc.

Now the python crawler is not scaling well..it shouldn't it was not
meant to do so. However Nutch scales very very well. So after seeing
the Solr/Nutch integration article .. I got very excited but the
problem is as follows..

If I use those patch and fixes I let Nutch do the crawling and
indexing! This means that I will not be able to use Solr for indexing
and i.e. miss quite a lot of Solr feature which Nutch doesn't have.
This was not clear to me before .. so I am desperately looking for a
crawler and I have found an interesting project (However I am not
found of XPATH.. I can leave with it)

http://web-harvest.sourceforge.net/

Interesting.. but slow.. I do believe a scraping tool of such would be
nice extension to droids. Anyway when Thorsten this morning anounce
Droids.. again I was very excited and gave it a go.
So I am interested to see how can droids help me? My primary need is
that A scaleable crawler that will save crawled data as Solr XML at
the end maybe even execute post.jar for posting to Solr index.

I think one can hack Nutch to get what I want but I am no where near
in terms of hacking nutch. My project would be benefited from Nutch
has ther parsers too... parse-rss and parse-xml

Not sure if the above says much but that is what I need and any
help/pointer is very much appreciated.

Regards

salu2
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to