[
https://issues.apache.org/jira/browse/NUTCH-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14121530#comment-14121530
]
Chris A. Mattmann commented on NUTCH-1832:
------------------------------------------
Hi Julien:
OK here are a couple of more in depth responses:
bq. I was pointing at the Crawl class to show that indexing has always been
part of the behaviour in response to your statement that Nutch did not require
indexing before.
I am finally at my computer instead of on my phone this morning, so I had some
time to research this. I'm not sure what's up with you pointing at Github, but
the canonical bits for Apache Nutch are here in SVN. For whatever reason that
Crawl.java class that you pointed at stating that it hasn't been touched since
Doug in 2005 is totally wrong. For example, look at the 1.7 release of Nutch,
let's look at the tag here:
http://svn.apache.org/viewvc/nutch/tags/release-1.7/src/java/org/apache/nutch/crawl/Crawl.java?view=log
Note, that file has been changed quite a few times, even as far as 1.7. This
makes sense, since the way it used to work was the following:
{noformat}
[chipotle:~/tmp/apache-nutch-1.7] mattmann% ./bin/nutch crawl
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
[chipotle:~/tmp/apache-nutch-1.7] mattmann%
{noformat}
You'll note that the -solr parameter was optional. I know this b/c I've been
teaching Nutch in my search engines class for many years now, and we have a
particular assignment in which we only use Nutch to download the FBI's Vault
database. There was I repeat no indexing required as part of that assignment.
We used indexing later. Right now, Nutch 1.9 breaks my assignment. Are you
saying you don't care? Or that that's my problem? If so, that's the wrong
message to send the people teaching their students about Nutch.
You will note on line 96 in that file the following code block:
http://svn.apache.org/viewvc/nutch/tags/release-1.7/src/java/org/apache/nutch/crawl/Crawl.java?revision=1495187&view=markup
{code:java}
if (solrUrl == null) {
LOG.warn("solrUrl is not set, indexing will be skipped...");
}
else {
// for simplicity assume that SOLR is used
// and pass its URL via conf
getConf().set("solr.server.url", solrUrl);
}
{code}
Which means that if the solrUrl was not provided, indexing is skipped. Before
NUTCH-1832, 1.9 shipped with (and 1.10-trunk) *regressed* and broke backwards
compatibility with users like me and my students who were expecting this. We
also have large projects from DARPA and NASA now in the midst of using Nutch
and building platforms on it. I have stated repeatedly to these users (now
incorrectly, but not after NUTCH-1832) that they can use Nutch as a powerful
crawling/caching/mirroring software, without having an indexer up and running.
Don't get me wrong. I value indexing. We do it all the time. THere is just
something to be said about having a Nutch that works out of the box (like it
used to) without indexing and that functionality is what I was debating and
what I would like restored.
I did some research on this. Look like in NUTCH-1621, you removed this class
(jnioche):
{noformat}
[chipotle:~/tmp/apache-nutch-1.7] mattmann% svn log
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/ | more
Redirecting to URL
'http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl':
------------------------------------------------------------------------
r1619942 | snagel | 2014-08-22 15:23:27 -0700 (Fri, 22 Aug 2014) | 1 line
NUTCH-1693 TextMD5Signature computed on textual content
------------------------------------------------------------------------
r1619934 | snagel | 2014-08-22 14:23:32 -0700 (Fri, 22 Aug 2014) | 1 line
NUTCH-1409 remove deprecated properties db.{default,max}.fetch.interval,
generate.max.per.host.by.ip
------------------------------------------------------------------------
r1610631 | jnioche | 2014-07-15 02:34:38 -0700 (Tue, 15 Jul 2014) | 1 line
NUTCH-1422 Bypass signature comparison when a document is redirected (snagel)
------------------------------------------------------------------------
r1608431 | jnioche | 2014-07-07 05:38:23 -0700 (Mon, 07 Jul 2014) | 1 line
NUTCH-578 URL fetched with 403 is generated over and over again
------------------------------------------------------------------------
r1605204 | snagel | 2014-06-24 14:41:28 -0700 (Tue, 24 Jun 2014) | 1 line
NUTCH-1787 update and complete API doc overview page
------------------------------------------------------------------------
r1597556 | markus | 2014-05-26 03:47:11 -0700 (Mon, 26 May 2014) | 2 lines
NUTCH-1786 CrawlDb should follow db.url.normalizers and db.url.filters
------------------------------------------------------------------------
r1595137 | jnioche | 2014-05-16 00:59:05 -0700 (Fri, 16 May 2014) | 1 line
NUTCH-1772 Injector does not need merging if no pre-existing crawldb
------------------------------------------------------------------------
r1593901 | jnioche | 2014-05-12 00:59:01 -0700 (Mon, 12 May 2014) | 1 line
NUTCH-1766
------------------------------------------------------------------------
r1590315 | snagel | 2014-04-26 15:12:46 -0700 (Sat, 26 Apr 2014) | 1 line
NUTCH-1764 readdb to show command-line help if no action (-stats, -dump, etc.)
given
------------------------------------------------------------------------
r1575350 | snagel | 2014-03-07 10:13:20 -0800 (Fri, 07 Mar 2014) | 1 line
removed HostDB from Nutch 1.8 trunk: fix build, remove HostDb related entries
from change log
------------------------------------------------------------------------
r1560316 | tejasp | 2014-01-22 03:25:25 -0800 (Wed, 22 Jan 2014) | 1 line
NUTCH-1325 HostDB for Nutch
------------------------------------------------------------------------
r1559657 | markus | 2014-01-20 01:29:42 -0800 (Mon, 20 Jan 2014) | 1 line
NUTCH-1680 CrawlDbReader to dump minRetry value
------------------------------------------------------------------------
r1554883 | tejasp | 2014-01-02 11:40:18 -0800 (Thu, 02 Jan 2014) | 1 line
NUTCH-1670 set same crawldb directory in mergedb parameter
------------------------------------------------------------------------
r1541917 | jnioche | 2013-11-14 06:36:12 -0800 (Thu, 14 Nov 2013) | 1 line
Giving Cleaning and Deduplication jobs a name to display
------------------------------------------------------------------------
r1541885 | jnioche | 2013-11-14 04:11:36 -0800 (Thu, 14 Nov 2013) | 1 line
Removed all in one Crawl class (NUTCH-1621)
------------------------------------------------------------------------
r1541883 | jnioche | 2013-11-14 03:55:33 -0800 (Thu, 14 Nov 2013) | 1 line
{noformat}
Looking at NUTCH-1621, (and the issue that it reffed NUTCH-1087), it looks
like the first appearance of this code was in Nutch 1.8 (which I haven't used
in my class since the last one we used I think was 1.7 which is why this change
matters to me). You'll also note that I missed out on the discussion on those
issues. Looks like they were discussed, etc. So that's my bad. At the same
time, I feel that those issues introduced a backwards incompatibility that I
believe NUTCH-1832 can address.
I'm happy to re-enable the indexing plugin since I actually messed up since it
wasn't my intention to make it so that indexing *didn't work* out of the box. I
just wanted it to *not be required*. So my bad on that. I've went ahead and
fixed that in trunk:
{noformat}
[mattmann-0420740:~/src/nutch] mattmann% svn diff
Index: conf/nutch-default.xml
===================================================================
--- conf/nutch-default.xml (revision 1622354)
+++ conf/nutch-default.xml (working copy)
@@ -1042,7 +1042,7 @@
<property>
<name>plugin.includes</name>
-
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
[mattmann-0420740:~/src/nutch] mattmann% svn commit -m "Re-enable the
indexer-solr plugin by default (NUTCH-1832 still works.)"
Sending conf/nutch-default.xml
Transmitting file data .
Committed revision 1622510.
[mattmann-0420740:~/src/nutch] mattmann%
{noformat}
As to the tone of the conversation. As I stated to you on Facebook - I believe
that in fact you have been in disagreement with the things I've been suggesting
for a while - the last two of them at least, but potentially beyond that. I
also perceived your comments (just as you perceived mine) as aggressive.
Ultimately at the end of the day I respect you a ton and am happy to be working
with you on this project (along with everyone else). The goal here is to have
enough consensus to move forward on issues and to have shared stewardship and
development of a code base. We should make everyone feel like their
contributions and voices are being heard. If you perceived my comments as
aggressive, I'm sorry about that - the fact that you perceived them that way is
enough for me to take a step back and tell you I'm sorry and let's simply move
forward.
> Make Nutch work without an indexer
> ----------------------------------
>
> Key: NUTCH-1832
> URL: https://issues.apache.org/jira/browse/NUTCH-1832
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.9
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.10
>
> Attachments: NUTCH-1832.Mattmann.090314.patch.2.txt,
> NUTCH-1832.Mattmann.090314.patch.txt
>
>
> Nutch used to work out of the box, without requiring an indexing backend. As
> of 1.9, that's not the case anymore (it's possible even before that). Thanks
> to [~markus17] for pointing out that this is due to the indexing-solr plugin
> being enabled by default. We should disable it by default, so that the
> regression is removed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)