Hi Eddie, I've added you to the AdminGroup for our wiki, you will be able to edit whichever areas you are interested in, or which you think can/should be improved.
Your introduction sounds real interesting and as Markus & Julien have said there is a lot of issues which merit some input, its great that you are able to contribute. Just a quick side-note, as Julien said we also maintain a Nutchgora branch, which has some unique characteristics which you might find interesting. Best for now Lewis On Tue, Jan 17, 2012 at 9:31 PM, Eddie Drapkin <[email protected]> wrote: > Alrighty! > > I checked out the JIRA and sort of attacked an issue I think I can > contribute to... I'll look and try to find more as well. > > I can certainly write documentation if that's a need (when isn't it?), > just someone point me at the areas that need better documentation and I'll > do what I can. You mentioned distributed mode, which is something I > actually can't really document because it's not something we use - our > crawler exists as a single intranet server and probably will for the > foreseeable future. Do I need any special account privileges to edit wiki > pages (username is EdwardDrapkin)? > > We use Nutch here to crawl our various intranet sites to build Lucene > indexes for a few search applications that we have (search.wolfram.com, > mathworld, etc.). I've written a rather hefty plugin for it to accommodate > some of the custom functionality we need (I'd guess it's ~20,000 lines of > code). We have our search broken down by our sites (e.g. > reference.wolfram.com is one index and mathworld is another), which are > crawled separately, so a lot of our custom functionality is written in > light of that, particularly scoring. Because it's custom code for a single > purpose, a lot of the code is also there to curate the data going into the > index (custom parsers for a particular site to remove navigation elements, > for instance). The most (only, really) interesting thing that I've done > with it is tracking wiki changes outside of the primary crawl database (I > keep my own database of page modification times) and creating custom fetch > lists, so that our wiki can be crawled nightly, as it's rather massive and > hosted on a shared machine that can't support an intensive crawl every > night. I've also re-created the lucene index plugin as part of our plugin, > as we don't use Solr, but our own search application. > > I'm working now on creating a comprehensive link-graph of all links for a > particular crawl configuration, while still only crawling the correct URLs, > so that we can experiment with using various page scoring algorithms. This > is why I wanted to not filter the links in the parse stage, so now I can > have a crawldb with entries from anywhere on the internet while still only > crawling a particular subdomain. > > I'm not sure what the standard use case is for Nutch, but I think we're > probably a bit outside of it, but only a bit. > > Thanks, > Eddie > > > > > On 1/17/2012 1:22 PM, Julien Nioche wrote: > > Hi Eddie, > > Great to hear that! Just to add to what Markus said there are also quite a > few tasks to do on the NutchGora branch if that's something you'd be > interested in. Or outside the tasks on JIRA, there is always a fair bit to > do on the Wiki e.g. how to run in distributed mode etc... > > Just out of curiosity, could you tell us a bit about what you've been > using Nutch for at Wolfram Research? > > Thanks for volunteering > > Julien > > On 17 January 2012 19:15, Markus Jelsma <[email protected]>wrote: > >> Hi! >> >> Excellent! You may want to check the list of issues for 1.5. There are >> several >> issues being worked on from time to time and a number of open issues and >> even >> a few hairy problems. Contribution as patch or comment on any issue is >> always >> appreciated. You can also create issues to solve problems yourself as you >> did >> with the parser filters issue. >> >> Anything is welcome! >> >> Cheers, >> >> > Hello all, >> > >> > I've got a bunch of spare time coming up in the next several >> > weeks/months and would like to volunteer to help the project out. I'm >> > already extremely familiar with the internals of Nutch, as I've been >> > hacking at it for our internal use here (at Wolfram Research) for the >> > last ~1.5 years or so. While there's probably a fair amount of code >> > that I haven't read, I've at least visited and read some of all of the >> > areas of Nutch's core and most of the plugins. >> > >> > I think I should put that knowledge to good use and contribute back >> > (I've already sent some patches in, but nothing major or really even >> > that significant), but I'm not sure what needs to be done or where my >> > time would be best spent. I just subscribed to this list, so if there's >> > a thread discussing priorities that's current and whatnot, can someone >> > point me to it in the archives? Barring that, can someone point me in >> > the direction where I should be looking to contribute? My best guess is >> > to just start attacking JIRA tickets... >> > >> > Thanks, >> > Eddie >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > > > -- *Lewis*

