We used Nutch once upon a time. It seems to be a pretty good project, it just didn't fit our needs after a while. This was largely due to the fact that our implementation was writing out to a new Apache Lucene indexes on every crawl. We then had to write our own code to front end those indexes.

We move to using Apache Solr, which is pretty awesome. ElasticSearch would be another good option. I know Nutch can interface with Solr, not sure about ElasticSearch.

We use Apache Droids (incubating), which provides a large part of the necessary framework to write your own crawler. This gives us total control over how everything works. This allows us to handle the parsing of content how we would like, and it also allows us to record extra information such as permanent redirects, the web of links between all of our pages. With knowing all of the links between everything, we can report on broken links, and links behind redirects. I don't know that you can get that out of Nutch. I am a member of the Droids project.

If you're only after the content of this site in its own context (ignoring the links to or from pages outside of this), and the content you're really after is in the database, just get the content out of the database. Crawling is so much more messier than direct retrieval.

If you're looking at using Nutch, I'd go ask over on their list. They are bound to have some good information for you on how you might be able to do it. I know such a thing would be possible with HttpClient, authenticating against CAS, picking up the application session, I just don't know how you'd integrate that with Nutch.

On 03/20/2014 05:13 PM, Laura McCord wrote:
There was an idea about using Apache Nutch though I?ve never used it before. 
I?m brainstorming here, but if I can create a little app that asks for 
credentials and once entered will crawl using Nutch a given website..wondering 
if that would work.

Thanks,
Laura


On Mar 20, 2014, at 5:01 PM, Richard Frovarp <[email protected]> wrote:

On 03/20/2014 04:52 PM, Laura McCord wrote:
Hi,

This might be a shot in the dark but, I was wondering if anyone has any 
experience with web-crawling a website that is ?Casified? but by entering your 
credentials it will proceed to crawl and obtain the content? If so, did you use 
any specific technologies to perform the task?

Thanks,
  Laura



It kind of depends on what you're after here. Are you looking at letting Google 
through, or your own crawler?

If it's your own, does it even need to be a web crawler? My experience with 
search is around Apache Solr. In that case, I'd just get the data directly out 
of the database and put it in Solr. Generally you get better search results if 
you don't have to mess with those pesky things we call web pages.

--
You are currently subscribed to [email protected] as: 
[email protected]
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-user



--
You are currently subscribed to [email protected] as: 
[email protected]
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-user

Reply via email to