Hi Eddie,

I've added you to the AdminGroup for our wiki, you will be able to edit
whichever areas you are interested in, or which you think can/should be
improved.

Your introduction sounds real interesting and as Markus & Julien have said
there is a lot of issues which merit some input, its great that you are
able to contribute. Just a quick side-note, as Julien said we also maintain
a Nutchgora branch, which has some unique characteristics which you might
find interesting.

Best for now

Lewis

On Tue, Jan 17, 2012 at 9:31 PM, Eddie Drapkin <[email protected]> wrote:

>  Alrighty!
>
> I checked out the JIRA and sort of attacked an issue I think I can
> contribute to... I'll look and try to find more as well.
>
> I can certainly write documentation if that's a need (when isn't it?),
> just someone point me at the areas that need better documentation and I'll
> do what I can.  You mentioned distributed mode, which is something I
> actually can't really document because it's not something we use - our
> crawler exists as a single intranet server and probably will for the
> foreseeable future.  Do I need any special account privileges to edit wiki
> pages (username is EdwardDrapkin)?
>
> We use Nutch here to crawl our various intranet sites to build Lucene
> indexes for a few search applications that we have (search.wolfram.com,
> mathworld, etc.).  I've written a rather hefty plugin for it to accommodate
> some of the custom functionality we need (I'd guess it's ~20,000 lines of
> code).  We have our search broken down by our sites (e.g.
> reference.wolfram.com is one index and mathworld is another), which are
> crawled separately, so a lot of our custom functionality is written in
> light of that, particularly scoring.  Because it's custom code for a single
> purpose, a lot of the code is also there to curate the data going into the
> index (custom parsers for a particular site to remove navigation elements,
> for instance).  The most (only, really) interesting thing that I've done
> with it is tracking wiki changes outside of the primary crawl database (I
> keep my own database of page modification times) and creating custom fetch
> lists, so that our wiki can be crawled nightly, as it's rather massive and
> hosted on a shared machine that can't support an intensive crawl every
> night.  I've also re-created the lucene index plugin as part of our plugin,
> as we don't use Solr, but our own search application.
>
> I'm working now on creating a comprehensive link-graph of all links for a
> particular crawl configuration, while still only crawling the correct URLs,
> so that we can experiment with using various page scoring algorithms.  This
> is why I wanted to not filter the links in the parse stage, so now I can
> have a crawldb with entries from anywhere on the internet while still only
> crawling a particular subdomain.
>
> I'm not sure what the standard use case is for Nutch, but I think we're
> probably a bit outside of it, but only a bit.
>
> Thanks,
> Eddie
>
>
>
>
> On 1/17/2012 1:22 PM, Julien Nioche wrote:
>
> Hi Eddie,
>
> Great to hear that! Just to add to what Markus said there are also quite a
> few tasks to do on the NutchGora branch if that's something you'd be
> interested in. Or outside the tasks on JIRA, there is always a fair bit to
> do on the Wiki e.g. how to run in distributed mode etc...
>
> Just out of curiosity, could you tell us a bit about what you've been
> using Nutch for at Wolfram Research?
>
> Thanks for volunteering
>
> Julien
>
> On 17 January 2012 19:15, Markus Jelsma <[email protected]>wrote:
>
>> Hi!
>>
>> Excellent! You may want to check the list of issues for 1.5. There are
>> several
>> issues being worked on from time to time and a number of open issues and
>> even
>> a few hairy problems. Contribution as patch or comment on any issue is
>> always
>> appreciated. You can also create issues to solve problems yourself as you
>> did
>> with the parser filters issue.
>>
>> Anything is welcome!
>>
>> Cheers,
>>
>> > Hello all,
>> >
>> > I've got a bunch of spare time coming up in the next several
>> > weeks/months and would like to volunteer to help the project out.  I'm
>> > already extremely familiar with the internals of Nutch, as I've been
>> > hacking at it for our internal use here (at Wolfram Research) for the
>> > last ~1.5 years or so.  While there's probably a fair amount of code
>> > that I haven't read, I've at least visited and read some of all of the
>> > areas of Nutch's core and most of the plugins.
>> >
>> > I think I should put that knowledge to good use and contribute back
>> > (I've already sent some patches in, but nothing major or really even
>> > that significant), but I'm not sure what needs to be done or where my
>> > time would be best spent.  I just subscribed to this list, so if there's
>> > a thread discussing priorities that's current and whatnot, can someone
>> > point me to it in the archives?  Barring that, can someone point me in
>> > the direction where I should be looking to contribute?  My best guess is
>> > to just start attacking JIRA tickets...
>> >
>> > Thanks,
>> > Eddie
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
>
>


-- 
*Lewis*

Reply via email to