Alrighty!
I checked out the JIRA and sort of attacked an issue I think I can
contribute to... I'll look and try to find more as well.
I can certainly write documentation if that's a need (when isn't it?),
just someone point me at the areas that need better documentation and
I'll do what I can. You mentioned distributed mode, which is something
I actually can't really document because it's not something we use - our
crawler exists as a single intranet server and probably will for the
foreseeable future. Do I need any special account privileges to edit
wiki pages (username is EdwardDrapkin)?
We use Nutch here to crawl our various intranet sites to build Lucene
indexes for a few search applications that we have (search.wolfram.com,
mathworld, etc.). I've written a rather hefty plugin for it to
accommodate some of the custom functionality we need (I'd guess it's
~20,000 lines of code). We have our search broken down by our sites
(e.g. reference.wolfram.com is one index and mathworld is another),
which are crawled separately, so a lot of our custom functionality is
written in light of that, particularly scoring. Because it's custom
code for a single purpose, a lot of the code is also there to curate the
data going into the index (custom parsers for a particular site to
remove navigation elements, for instance). The most (only, really)
interesting thing that I've done with it is tracking wiki changes
outside of the primary crawl database (I keep my own database of page
modification times) and creating custom fetch lists, so that our wiki
can be crawled nightly, as it's rather massive and hosted on a shared
machine that can't support an intensive crawl every night. I've also
re-created the lucene index plugin as part of our plugin, as we don't
use Solr, but our own search application.
I'm working now on creating a comprehensive link-graph of all links for
a particular crawl configuration, while still only crawling the correct
URLs, so that we can experiment with using various page scoring
algorithms. This is why I wanted to not filter the links in the parse
stage, so now I can have a crawldb with entries from anywhere on the
internet while still only crawling a particular subdomain.
I'm not sure what the standard use case is for Nutch, but I think we're
probably a bit outside of it, but only a bit.
Thanks,
Eddie
On 1/17/2012 1:22 PM, Julien Nioche wrote:
Hi Eddie,
Great to hear that! Just to add to what Markus said there are also
quite a few tasks to do on the NutchGora branch if that's something
you'd be interested in. Or outside the tasks on JIRA, there is always
a fair bit to do on the Wiki e.g. how to run in distributed mode etc...
Just out of curiosity, could you tell us a bit about what you've been
using Nutch for at Wolfram Research?
Thanks for volunteering
Julien
On 17 January 2012 19:15, Markus Jelsma <[email protected]
<mailto:[email protected]>> wrote:
Hi!
Excellent! You may want to check the list of issues for 1.5. There
are several
issues being worked on from time to time and a number of open
issues and even
a few hairy problems. Contribution as patch or comment on any
issue is always
appreciated. You can also create issues to solve problems yourself
as you did
with the parser filters issue.
Anything is welcome!
Cheers,
> Hello all,
>
> I've got a bunch of spare time coming up in the next several
> weeks/months and would like to volunteer to help the project
out. I'm
> already extremely familiar with the internals of Nutch, as I've been
> hacking at it for our internal use here (at Wolfram Research)
for the
> last ~1.5 years or so. While there's probably a fair amount of code
> that I haven't read, I've at least visited and read some of all
of the
> areas of Nutch's core and most of the plugins.
>
> I think I should put that knowledge to good use and contribute back
> (I've already sent some patches in, but nothing major or really even
> that significant), but I'm not sure what needs to be done or
where my
> time would be best spent. I just subscribed to this list, so if
there's
> a thread discussing priorities that's current and whatnot, can
someone
> point me to it in the archives? Barring that, can someone point
me in
> the direction where I should be looking to contribute? My best
guess is
> to just start attacking JIRA tickets...
>
> Thanks,
> Eddie
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com