Re: [cas-user] Web-crawling a "casified" page

Laura McCord Thu, 20 Mar 2014 17:44:33 -0700

Thanks for all the great ideas, this should help me get started. 

Laura




On Mar 20, 2014, at 5:32 PM, Richard Frovarp <[email protected]> wrote:

> We used Nutch once upon a time. It seems to be a pretty good project, it just 
> didn't fit our needs after a while. This was largely due to the fact that our 
> implementation was writing out to a new Apache Lucene indexes on every crawl. 
> We then had to write our own code to front end those indexes.
> 
> We move to using Apache Solr, which is pretty awesome. ElasticSearch would be 
> another good option. I know Nutch can interface with Solr, not sure about 
> ElasticSearch.
> 
> We use Apache Droids (incubating), which provides a large part of the 
> necessary framework to write your own crawler. This gives us total control 
> over how everything works. This allows us to handle the parsing of content 
> how we would like, and it also allows us to record extra information such as 
> permanent redirects, the web of links between all of our pages. With knowing 
> all of the links between everything, we can report on broken links, and links 
> behind redirects. I don't know that you can get that out of Nutch. I am a 
> member of the Droids project.
> 
> If you're only after the content of this site in its own context (ignoring 
> the links to or from pages outside of this), and the content you're really 
> after is in the database, just get the content out of the database. Crawling 
> is so much more messier than direct retrieval.
> 
> If you're looking at using Nutch, I'd go ask over on their list. They are 
> bound to have some good information for you on how you might be able to do 
> it. I know such a thing would be possible with HttpClient, authenticating 
> against CAS, picking up the application session, I just don't know how you'd 
> integrate that with Nutch.
> 
> On 03/20/2014 05:13 PM, Laura McCord wrote:
>> There was an idea about using Apache Nutch though I?ve never used it before. 
>> I?m brainstorming here, but if I can create a little app that asks for 
>> credentials and once entered will crawl using Nutch a given 
>> website..wondering if that would work.
>> 
>> Thanks,
>> Laura
>> 
>> 
>> On Mar 20, 2014, at 5:01 PM, Richard Frovarp <[email protected]> 
>> wrote:
>> 
>>> On 03/20/2014 04:52 PM, Laura McCord wrote:
>>>> Hi,
>>>> 
>>>> This might be a shot in the dark but, I was wondering if anyone has any 
>>>> experience with web-crawling a website that is ?Casified? but by entering 
>>>> your credentials it will proceed to crawl and obtain the content? If so, 
>>>> did you use any specific technologies to perform the task?
>>>> 
>>>> Thanks,
>>>>  Laura
>>>> 
>>>> 
>>>> 
>>> It kind of depends on what you're after here. Are you looking at letting 
>>> Google through, or your own crawler?
>>> 
>>> If it's your own, does it even need to be a web crawler? My experience with 
>>> search is around Apache Solr. In that case, I'd just get the data directly 
>>> out of the database and put it in Solr. Generally you get better search 
>>> results if you don't have to mess with those pesky things we call web pages.
>>> 
>>> -- 
>>> You are currently subscribed to [email protected] as: 
>>> [email protected]
>>> To unsubscribe, change settings or access archives, see 
>>> http://www.ja-sig.org/wiki/display/JSG/cas-user
>> 
> 
> 
> -- 
> You are currently subscribed to [email protected] as: 
> [email protected]
> To unsubscribe, change settings or access archives, see 
> http://www.ja-sig.org/wiki/display/JSG/cas-user


-- 
You are currently subscribed to [email protected] as: 
[email protected]
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-user

Re: [cas-user] Web-crawling a "casified" page

Reply via email to