Re: I want to volunteer some time

Eddie Drapkin Tue, 17 Jan 2012 13:31:56 -0800

Alrighty!

I checked out the JIRA and sort of attacked an issue I think I cancontribute to... I'll look and try to find more as well.

I can certainly write documentation if that's a need (when isn't it?),just someone point me at the areas that need better documentation andI'll do what I can. You mentioned distributed mode, which is somethingI actually can't really document because it's not something we use - ourcrawler exists as a single intranet server and probably will for theforeseeable future. Do I need any special account privileges to editwiki pages (username is EdwardDrapkin)?

We use Nutch here to crawl our various intranet sites to build Luceneindexes for a few search applications that we have (search.wolfram.com,mathworld, etc.). I've written a rather hefty plugin for it toaccommodate some of the custom functionality we need (I'd guess it's~20,000 lines of code). We have our search broken down by our sites(e.g. reference.wolfram.com is one index and mathworld is another),which are crawled separately, so a lot of our custom functionality iswritten in light of that, particularly scoring. Because it's customcode for a single purpose, a lot of the code is also there to curate thedata going into the index (custom parsers for a particular site toremove navigation elements, for instance). The most (only, really)interesting thing that I've done with it is tracking wiki changesoutside of the primary crawl database (I keep my own database of pagemodification times) and creating custom fetch lists, so that our wikican be crawled nightly, as it's rather massive and hosted on a sharedmachine that can't support an intensive crawl every night. I've alsore-created the lucene index plugin as part of our plugin, as we don'tuse Solr, but our own search application.

I'm working now on creating a comprehensive link-graph of all links fora particular crawl configuration, while still only crawling the correctURLs, so that we can experiment with using various page scoringalgorithms. This is why I wanted to not filter the links in the parsestage, so now I can have a crawldb with entries from anywhere on theinternet while still only crawling a particular subdomain.

I'm not sure what the standard use case is for Nutch, but I think we'reprobably a bit outside of it, but only a bit.


Thanks,
Eddie



On 1/17/2012 1:22 PM, Julien Nioche wrote:

Hi Eddie,

Great to hear that! Just to add to what Markus said there are alsoquite a few tasks to do on the NutchGora branch if that's somethingyou'd be interested in. Or outside the tasks on JIRA, there is alwaysa fair bit to do on the Wiki e.g. how to run in distributed mode etc...

Just out of curiosity, could you tell us a bit about what you've beenusing Nutch for at Wolfram Research?


Thanks for volunteering

Julien

On 17 January 2012 19:15, Markus Jelsma <[email protected]<mailto:[email protected]>> wrote:


    Hi!

    Excellent! You may want to check the list of issues for 1.5. There
    are several
    issues being worked on from time to time and a number of open
    issues and even
    a few hairy problems. Contribution as patch or comment on any
    issue is always
    appreciated. You can also create issues to solve problems yourself
    as you did
    with the parser filters issue.

    Anything is welcome!

    Cheers,

    > Hello all,
    >
    > I've got a bunch of spare time coming up in the next several
    > weeks/months and would like to volunteer to help the project
    out.  I'm
    > already extremely familiar with the internals of Nutch, as I've been
    > hacking at it for our internal use here (at Wolfram Research)
    for the
    > last ~1.5 years or so.  While there's probably a fair amount of code
    > that I haven't read, I've at least visited and read some of all
    of the
    > areas of Nutch's core and most of the plugins.
    >
    > I think I should put that knowledge to good use and contribute back
    > (I've already sent some patches in, but nothing major or really even
    > that significant), but I'm not sure what needs to be done or
    where my
    > time would be best spent.  I just subscribed to this list, so if
    there's
    > a thread discussing priorities that's current and whatnot, can
    someone
    > point me to it in the archives?  Barring that, can someone point
    me in
    > the direction where I should be looking to contribute?  My best
    guess is
    > to just start attacking JIRA tickets...
    >
    > Thanks,
    > Eddie




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: I want to volunteer some time

Reply via email to