[jira] [Commented] (NUTCH-1832) Make Nutch work without an indexer

Chris A. Mattmann (JIRA) Thu, 04 Sep 2014 09:34:50 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121530#comment-14121530
 ]


Chris A. Mattmann commented on NUTCH-1832:
------------------------------------------

Hi Julien:

OK here are a couple of more in depth responses:

bq. I was pointing at the Crawl class to show that indexing has always been 
part of the behaviour in response to your statement that Nutch did not require 
indexing before.

I am finally at my computer instead of on my phone this morning, so I had some 
time to research this. I'm not sure what's up with you pointing at Github, but 
the canonical bits for Apache Nutch are here in SVN. For whatever reason that 
Crawl.java class that you pointed at stating that it hasn't been touched since 
Doug in 2005 is totally wrong. For example, look at the 1.7 release of Nutch, 
let's look at the tag here:

http://svn.apache.org/viewvc/nutch/tags/release-1.7/src/java/org/apache/nutch/crawl/Crawl.java?view=log

Note, that file has been changed quite a few times, even as far as 1.7. This 
makes sense, since the way it used to work was the following:

{noformat}
[chipotle:~/tmp/apache-nutch-1.7] mattmann% ./bin/nutch crawl
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
[chipotle:~/tmp/apache-nutch-1.7] mattmann% 
{noformat}

You'll note that the -solr parameter was optional. I know this b/c I've been 
teaching Nutch in my search engines class for many years now, and we have a 
particular assignment in which we only use Nutch to download the FBI's Vault 
database. There was I repeat no indexing required as part of that assignment. 
We used indexing later. Right now, Nutch 1.9 breaks my assignment. Are you 
saying you don't care? Or that that's my problem? If so, that's the wrong 
message to send the people teaching their students about Nutch.

You will note on line 96 in that file the following code block:
http://svn.apache.org/viewvc/nutch/tags/release-1.7/src/java/org/apache/nutch/crawl/Crawl.java?revision=1495187&view=markup

{code:java}
            if (solrUrl == null) {
              LOG.warn("solrUrl is not set, indexing will be skipped...");
            }
            else {
                // for simplicity assume that SOLR is used 
                // and pass its URL via conf 
                getConf().set("solr.server.url", solrUrl);
            }
{code}

Which means that if the solrUrl was not provided, indexing is skipped. Before 
NUTCH-1832, 1.9 shipped with (and 1.10-trunk) *regressed* and broke backwards 
compatibility with users like me and my students who were expecting this. We 
also have large projects from DARPA and NASA now in the midst of using Nutch 
and building platforms on it. I have stated repeatedly to these users (now 
incorrectly, but not after NUTCH-1832) that they can use Nutch as a powerful 
crawling/caching/mirroring software, without having an indexer up and running. 
Don't get me wrong. I value indexing. We do it all the time. THere is just 
something to be said about having a Nutch that works out of the box (like it 
used to) without indexing and that functionality is what I was debating and 
what I would like restored.

I did some research on this. Look like in NUTCH-1621, you removed this class 
(jnioche):

{noformat}
[chipotle:~/tmp/apache-nutch-1.7] mattmann% svn log 
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/ | more
Redirecting to URL 
'http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl':
------------------------------------------------------------------------
r1619942 | snagel | 2014-08-22 15:23:27 -0700 (Fri, 22 Aug 2014) | 1 line

NUTCH-1693 TextMD5Signature computed on textual content
------------------------------------------------------------------------
r1619934 | snagel | 2014-08-22 14:23:32 -0700 (Fri, 22 Aug 2014) | 1 line

NUTCH-1409 remove deprecated properties db.{default,max}.fetch.interval, 
generate.max.per.host.by.ip
------------------------------------------------------------------------
r1610631 | jnioche | 2014-07-15 02:34:38 -0700 (Tue, 15 Jul 2014) | 1 line

NUTCH-1422 Bypass signature comparison when a document is redirected (snagel)
------------------------------------------------------------------------
r1608431 | jnioche | 2014-07-07 05:38:23 -0700 (Mon, 07 Jul 2014) | 1 line

NUTCH-578 URL fetched with 403 is generated over and over again
------------------------------------------------------------------------
r1605204 | snagel | 2014-06-24 14:41:28 -0700 (Tue, 24 Jun 2014) | 1 line

NUTCH-1787 update and complete API doc overview page
------------------------------------------------------------------------
r1597556 | markus | 2014-05-26 03:47:11 -0700 (Mon, 26 May 2014) | 2 lines

NUTCH-1786 CrawlDb should follow db.url.normalizers and db.url.filters

------------------------------------------------------------------------
r1595137 | jnioche | 2014-05-16 00:59:05 -0700 (Fri, 16 May 2014) | 1 line

NUTCH-1772 Injector does not need merging if no pre-existing crawldb
------------------------------------------------------------------------
r1593901 | jnioche | 2014-05-12 00:59:01 -0700 (Mon, 12 May 2014) | 1 line

NUTCH-1766
------------------------------------------------------------------------
r1590315 | snagel | 2014-04-26 15:12:46 -0700 (Sat, 26 Apr 2014) | 1 line

NUTCH-1764 readdb to show command-line help if no action (-stats, -dump, etc.) 
given
------------------------------------------------------------------------
r1575350 | snagel | 2014-03-07 10:13:20 -0800 (Fri, 07 Mar 2014) | 1 line

removed HostDB from Nutch 1.8 trunk: fix build, remove HostDb related entries 
from change log
------------------------------------------------------------------------
r1560316 | tejasp | 2014-01-22 03:25:25 -0800 (Wed, 22 Jan 2014) | 1 line

NUTCH-1325 HostDB for Nutch
------------------------------------------------------------------------
r1559657 | markus | 2014-01-20 01:29:42 -0800 (Mon, 20 Jan 2014) | 1 line

NUTCH-1680 CrawlDbReader to dump minRetry value
------------------------------------------------------------------------
r1554883 | tejasp | 2014-01-02 11:40:18 -0800 (Thu, 02 Jan 2014) | 1 line

NUTCH-1670 set same crawldb directory in mergedb parameter
------------------------------------------------------------------------
r1541917 | jnioche | 2013-11-14 06:36:12 -0800 (Thu, 14 Nov 2013) | 1 line

Giving Cleaning and Deduplication jobs a name to display
------------------------------------------------------------------------
r1541885 | jnioche | 2013-11-14 04:11:36 -0800 (Thu, 14 Nov 2013) | 1 line

Removed all in one Crawl class (NUTCH-1621)
------------------------------------------------------------------------
r1541883 | jnioche | 2013-11-14 03:55:33 -0800 (Thu, 14 Nov 2013) | 1 line
{noformat}

Looking at NUTCH-1621, (and the issue that it reffed NUTCH-1087), it looks  
like the first appearance of this code was in Nutch 1.8 (which I haven't used 
in my class since the last one we used I think was 1.7 which is why this change 
matters to me). You'll also note that I missed out on the discussion on those 
issues. Looks like they were discussed, etc. So that's my bad. At the same 
time, I feel that those issues introduced a backwards incompatibility that I 
believe NUTCH-1832 can address. 

I'm happy to re-enable the indexing plugin since I actually messed up since it 
wasn't my intention to make it so that indexing *didn't work* out of the box. I 
just wanted it to *not be required*. So my bad on that.  I've went ahead and 
fixed that in trunk:

{noformat}
[mattmann-0420740:~/src/nutch] mattmann% svn diff
Index: conf/nutch-default.xml
===================================================================
--- conf/nutch-default.xml      (revision 1622354)
+++ conf/nutch-default.xml      (working copy)
@@ -1042,7 +1042,7 @@
 
 <property>
   <name>plugin.includes</name>
-  
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+  
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
[mattmann-0420740:~/src/nutch] mattmann% svn commit -m "Re-enable the 
indexer-solr plugin by default (NUTCH-1832 still works.)"
Sending        conf/nutch-default.xml
Transmitting file data .
Committed revision 1622510.
[mattmann-0420740:~/src/nutch] mattmann% 
{noformat}

As to the tone of the conversation. As I stated to you on Facebook - I believe 
that in fact you have been in disagreement with the things I've been suggesting 
for a while - the last two of them at least, but potentially beyond that. I 
also perceived your comments (just as you perceived mine) as aggressive. 
Ultimately at the end of the day I respect you a ton and am happy to be working 
with you on this project (along with everyone else). The goal here is to have 
enough consensus to move forward on issues and to have shared stewardship and 
development of a code base. We should make everyone feel like their 
contributions and voices are being heard. If you perceived my comments as 
aggressive, I'm sorry about that - the fact that you perceived them that way is 
enough for me to take a step back and tell you I'm sorry and let's simply move 
forward. 




> Make Nutch work without an indexer
> ----------------------------------
>
>                 Key: NUTCH-1832
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1832
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.10
>
>         Attachments: NUTCH-1832.Mattmann.090314.patch.2.txt, 
> NUTCH-1832.Mattmann.090314.patch.txt
>
>
> Nutch used to work out of the box, without requiring an indexing backend. As 
> of 1.9, that's not the case anymore (it's possible even before that). Thanks 
> to [~markus17] for pointing out that this is due to the indexing-solr plugin 
> being enabled by default. We should disable it by default, so that the 
> regression is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1832) Make Nutch work without an indexer

Reply via email to