We used Nutch once upon a time. It seems to be a pretty good project, it
just didn't fit our needs after a while. This was largely due to the
fact that our implementation was writing out to a new Apache Lucene
indexes on every crawl. We then had to write our own code to front end
those indexes.
We move to using Apache Solr, which is pretty awesome. ElasticSearch
would be another good option. I know Nutch can interface with Solr, not
sure about ElasticSearch.
We use Apache Droids (incubating), which provides a large part of the
necessary framework to write your own crawler. This gives us total
control over how everything works. This allows us to handle the parsing
of content how we would like, and it also allows us to record extra
information such as permanent redirects, the web of links between all of
our pages. With knowing all of the links between everything, we can
report on broken links, and links behind redirects. I don't know that
you can get that out of Nutch. I am a member of the Droids project.
If you're only after the content of this site in its own context
(ignoring the links to or from pages outside of this), and the content
you're really after is in the database, just get the content out of the
database. Crawling is so much more messier than direct retrieval.
If you're looking at using Nutch, I'd go ask over on their list. They
are bound to have some good information for you on how you might be able
to do it. I know such a thing would be possible with HttpClient,
authenticating against CAS, picking up the application session, I just
don't know how you'd integrate that with Nutch.
On 03/20/2014 05:13 PM, Laura McCord wrote:
There was an idea about using Apache Nutch though I?ve never used it before.
I?m brainstorming here, but if I can create a little app that asks for
credentials and once entered will crawl using Nutch a given website..wondering
if that would work.
Thanks,
Laura
On Mar 20, 2014, at 5:01 PM, Richard Frovarp <[email protected]> wrote:
On 03/20/2014 04:52 PM, Laura McCord wrote:
Hi,
This might be a shot in the dark but, I was wondering if anyone has any
experience with web-crawling a website that is ?Casified? but by entering your
credentials it will proceed to crawl and obtain the content? If so, did you use
any specific technologies to perform the task?
Thanks,
Laura
It kind of depends on what you're after here. Are you looking at letting Google
through, or your own crawler?
If it's your own, does it even need to be a web crawler? My experience with
search is around Apache Solr. In that case, I'd just get the data directly out
of the database and put it in Solr. Generally you get better search results if
you don't have to mess with those pesky things we call web pages.
--
You are currently subscribed to [email protected] as:
[email protected]
To unsubscribe, change settings or access archives, see
http://www.ja-sig.org/wiki/display/JSG/cas-user
--
You are currently subscribed to [email protected] as:
[email protected]
To unsubscribe, change settings or access archives, see
http://www.ja-sig.org/wiki/display/JSG/cas-user