Thanks for all the great ideas, this should help me get started. Laura
On Mar 20, 2014, at 5:32 PM, Richard Frovarp <[email protected]> wrote: > We used Nutch once upon a time. It seems to be a pretty good project, it just > didn't fit our needs after a while. This was largely due to the fact that our > implementation was writing out to a new Apache Lucene indexes on every crawl. > We then had to write our own code to front end those indexes. > > We move to using Apache Solr, which is pretty awesome. ElasticSearch would be > another good option. I know Nutch can interface with Solr, not sure about > ElasticSearch. > > We use Apache Droids (incubating), which provides a large part of the > necessary framework to write your own crawler. This gives us total control > over how everything works. This allows us to handle the parsing of content > how we would like, and it also allows us to record extra information such as > permanent redirects, the web of links between all of our pages. With knowing > all of the links between everything, we can report on broken links, and links > behind redirects. I don't know that you can get that out of Nutch. I am a > member of the Droids project. > > If you're only after the content of this site in its own context (ignoring > the links to or from pages outside of this), and the content you're really > after is in the database, just get the content out of the database. Crawling > is so much more messier than direct retrieval. > > If you're looking at using Nutch, I'd go ask over on their list. They are > bound to have some good information for you on how you might be able to do > it. I know such a thing would be possible with HttpClient, authenticating > against CAS, picking up the application session, I just don't know how you'd > integrate that with Nutch. > > On 03/20/2014 05:13 PM, Laura McCord wrote: >> There was an idea about using Apache Nutch though I?ve never used it before. >> I?m brainstorming here, but if I can create a little app that asks for >> credentials and once entered will crawl using Nutch a given >> website..wondering if that would work. >> >> Thanks, >> Laura >> >> >> On Mar 20, 2014, at 5:01 PM, Richard Frovarp <[email protected]> >> wrote: >> >>> On 03/20/2014 04:52 PM, Laura McCord wrote: >>>> Hi, >>>> >>>> This might be a shot in the dark but, I was wondering if anyone has any >>>> experience with web-crawling a website that is ?Casified? but by entering >>>> your credentials it will proceed to crawl and obtain the content? If so, >>>> did you use any specific technologies to perform the task? >>>> >>>> Thanks, >>>> Laura >>>> >>>> >>>> >>> It kind of depends on what you're after here. Are you looking at letting >>> Google through, or your own crawler? >>> >>> If it's your own, does it even need to be a web crawler? My experience with >>> search is around Apache Solr. In that case, I'd just get the data directly >>> out of the database and put it in Solr. Generally you get better search >>> results if you don't have to mess with those pesky things we call web pages. >>> >>> -- >>> You are currently subscribed to [email protected] as: >>> [email protected] >>> To unsubscribe, change settings or access archives, see >>> http://www.ja-sig.org/wiki/display/JSG/cas-user >> > > > -- > You are currently subscribed to [email protected] as: > [email protected] > To unsubscribe, change settings or access archives, see > http://www.ja-sig.org/wiki/display/JSG/cas-user -- You are currently subscribed to [email protected] as: [email protected] To unsubscribe, change settings or access archives, see http://www.ja-sig.org/wiki/display/JSG/cas-user
