Alrighty!

I checked out the JIRA and sort of attacked an issue I think I can contribute to... I'll look and try to find more as well.

I can certainly write documentation if that's a need (when isn't it?), just someone point me at the areas that need better documentation and I'll do what I can. You mentioned distributed mode, which is something I actually can't really document because it's not something we use - our crawler exists as a single intranet server and probably will for the foreseeable future. Do I need any special account privileges to edit wiki pages (username is EdwardDrapkin)?

We use Nutch here to crawl our various intranet sites to build Lucene indexes for a few search applications that we have (search.wolfram.com, mathworld, etc.). I've written a rather hefty plugin for it to accommodate some of the custom functionality we need (I'd guess it's ~20,000 lines of code). We have our search broken down by our sites (e.g. reference.wolfram.com is one index and mathworld is another), which are crawled separately, so a lot of our custom functionality is written in light of that, particularly scoring. Because it's custom code for a single purpose, a lot of the code is also there to curate the data going into the index (custom parsers for a particular site to remove navigation elements, for instance). The most (only, really) interesting thing that I've done with it is tracking wiki changes outside of the primary crawl database (I keep my own database of page modification times) and creating custom fetch lists, so that our wiki can be crawled nightly, as it's rather massive and hosted on a shared machine that can't support an intensive crawl every night. I've also re-created the lucene index plugin as part of our plugin, as we don't use Solr, but our own search application.

I'm working now on creating a comprehensive link-graph of all links for a particular crawl configuration, while still only crawling the correct URLs, so that we can experiment with using various page scoring algorithms. This is why I wanted to not filter the links in the parse stage, so now I can have a crawldb with entries from anywhere on the internet while still only crawling a particular subdomain.

I'm not sure what the standard use case is for Nutch, but I think we're probably a bit outside of it, but only a bit.

Thanks,
Eddie



On 1/17/2012 1:22 PM, Julien Nioche wrote:
Hi Eddie,

Great to hear that! Just to add to what Markus said there are also quite a few tasks to do on the NutchGora branch if that's something you'd be interested in. Or outside the tasks on JIRA, there is always a fair bit to do on the Wiki e.g. how to run in distributed mode etc...

Just out of curiosity, could you tell us a bit about what you've been using Nutch for at Wolfram Research?

Thanks for volunteering

Julien

On 17 January 2012 19:15, Markus Jelsma <[email protected] <mailto:[email protected]>> wrote:

    Hi!

    Excellent! You may want to check the list of issues for 1.5. There
    are several
    issues being worked on from time to time and a number of open
    issues and even
    a few hairy problems. Contribution as patch or comment on any
    issue is always
    appreciated. You can also create issues to solve problems yourself
    as you did
    with the parser filters issue.

    Anything is welcome!

    Cheers,

    > Hello all,
    >
    > I've got a bunch of spare time coming up in the next several
    > weeks/months and would like to volunteer to help the project
    out.  I'm
    > already extremely familiar with the internals of Nutch, as I've been
    > hacking at it for our internal use here (at Wolfram Research)
    for the
    > last ~1.5 years or so.  While there's probably a fair amount of code
    > that I haven't read, I've at least visited and read some of all
    of the
    > areas of Nutch's core and most of the plugins.
    >
    > I think I should put that knowledge to good use and contribute back
    > (I've already sent some patches in, but nothing major or really even
    > that significant), but I'm not sure what needs to be done or
    where my
    > time would be best spent.  I just subscribed to this list, so if
    there's
    > a thread discussing priorities that's current and whatnot, can
    someone
    > point me to it in the archives?  Barring that, can someone point
    me in
    > the direction where I should be looking to contribute?  My best
    guess is
    > to just start attacking JIRA tickets...
    >
    > Thanks,
    > Eddie




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to