Re: [cas-user] Web-crawling a "casified" page

Richard Frovarp Thu, 20 Mar 2014 15:34:03 -0700

We used Nutch once upon a time. It seems to be a pretty good project, itjust didn't fit our needs after a while. This was largely due to thefact that our implementation was writing out to a new Apache Luceneindexes on every crawl. We then had to write our own code to front endthose indexes.

We move to using Apache Solr, which is pretty awesome. ElasticSearchwould be another good option. I know Nutch can interface with Solr, notsure about ElasticSearch.

We use Apache Droids (incubating), which provides a large part of thenecessary framework to write your own crawler. This gives us totalcontrol over how everything works. This allows us to handle the parsingof content how we would like, and it also allows us to record extrainformation such as permanent redirects, the web of links between all ofour pages. With knowing all of the links between everything, we canreport on broken links, and links behind redirects. I don't know thatyou can get that out of Nutch. I am a member of the Droids project.

If you're only after the content of this site in its own context(ignoring the links to or from pages outside of this), and the contentyou're really after is in the database, just get the content out of thedatabase. Crawling is so much more messier than direct retrieval.

If you're looking at using Nutch, I'd go ask over on their list. Theyare bound to have some good information for you on how you might be ableto do it. I know such a thing would be possible with HttpClient,authenticating against CAS, picking up the application session, I justdon't know how you'd integrate that with Nutch.


On 03/20/2014 05:13 PM, Laura McCord wrote:

There was an idea about using Apache Nutch though I?ve never used it before. 
I?m brainstorming here, but if I can create a little app that asks for 
credentials and once entered will crawl using Nutch a given website..wondering 
if that would work.

Thanks,
Laura


On Mar 20, 2014, at 5:01 PM, Richard Frovarp <[email protected]> wrote:

On 03/20/2014 04:52 PM, Laura McCord wrote:

Hi,

This might be a shot in the dark but, I was wondering if anyone has any 
experience with web-crawling a website that is ?Casified? but by entering your 
credentials it will proceed to crawl and obtain the content? If so, did you use 
any specific technologies to perform the task?

Thanks,
  Laura

It kind of depends on what you're after here. Are you looking at letting Google 
through, or your own crawler?

If it's your own, does it even need to be a web crawler? My experience with 
search is around Apache Solr. In that case, I'd just get the data directly out 
of the database and put it in Solr. Generally you get better search results if 
you don't have to mess with those pesky things we call web pages.

--
You are currently subscribed to [email protected] as: 
[email protected]
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-user



--
You are currently subscribed to [email protected] as: 
[email protected]
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-user

Re: [cas-user] Web-crawling a "casified" page

Reply via email to