Re: Nutch ML cleanup

2009-03-10 Thread Doug Cutting
ogjunk-nu...@yahoo.com is a member of nutch-...@lists.sourceforge.net 
and nutch-gene...@lists.sourceforge.net.  These lists do not otherwise 
appear to forward to Apache lists.  They used to perhaps forward through 
nutch.org lists, but that domain no longer forwards any email.  Please 
check the message headers to see how this message is routed to you.  If 
it is indeed routed through Apache servers then please send the headers 
to me.


Doug

Andrzej Bialecki wrote:

Otis Gospodnetic wrote:

Hi,

This has been bugging me for a while now.  For some reason Nutch MLs 
get the most junk emails - both rude/rudeish emails, as well as 
clear spam (with SPAM in the subject - something must be detecting 
it). 
I just looked at the headers of the clearly labeled spam messages and 
found that they all seem to come from SF:


 To: nutch-...@lists.sourceforge.net
 To: nutch-gene...@lists.sourceforge.net

I assume there is some kind of a mail forward from the old Nutch MLs 
on SF to the new Nutch MLs at ASF.

Do you think we could remove this forwarding and get rid of this spam?

Sami  Andrzej seem to be members who mght be able to make this 
change:


http://sourceforge.net/project/memberlist.php?group_id=59548


Actually, only Doug and Mike Cafarella are admins of that project.

Doug, could you please disable this forwarding?




Re: Plans on releasing another bug fix release?

2007-07-03 Thread Doug Cutting

Will the next release really be 1.0 or will it be 0.10?

Doug

Briggs wrote:

I was just curious to know if there were any plans to release a
maintenence/bug-fix release before 1.0.  I know there have been a slew
of patches and such (it's almost impossible to keep up, unless someone
has a suggestion on how to keep track of these, I may be missing
something), and was wondering when/if these would be applied to the
trunk and labeled as say 0.9.1.


Briggs.




Re: JIRA email question

2007-06-27 Thread Doug Cutting
The problem is that nutch-dev (like most Apache mailing lists) sets the 
Reply-to header to be itself, so that responses don't go back to the 
sender.  If you override this when responding (changing the To: line) 
and respond to the sender, then it should end up as a comment, which 
will be then copied to nutch-dev.  But there's unfortunately no way to 
automatically override this.  Thus its best to click on the link in the 
message and respond directly in Jira.  This is also more reliable. 
Sending messages to Jira doesn't always seem to work correctly.  It 
might be good to disable that sentence suggesting that folks reply to 
the email, but I don't know if that's possible.


Doug

Doğacan Güney wrote:

Hi list,

There is this sentence at the end of every JIRA message:

You can reply to this email to add a comment to the issue online.

But, replying to a JIRA message through nutch-dev doesn't add it as a
comment. So you have to either reply to an email through JIRA (in
which case, it looks like you are responding to an imaginary person:)
or through email (in which case, part of the discussion doesn't get
documented in JIRA). Why doesn't this work?



[jira] Commented: (NUTCH-479) Support for OR queries

2007-06-22 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507473
 ] 

Doug Cutting commented on NUTCH-479:


Neither.  It would end up as the Lucene query:

+search phrase +category:cat1 category:cat2

where category:cat2 is a non-required clause that just impacts ranking, not the 
set of documents returned.

As for nested queries, parsing is only half the problem.  The query filter 
plugins would need to be extended to handle such things, as they presently 
expect flat queries.

The query foo bar currently expands to a Lucene query that looks something 
like:

+(anchor:foo title:foo content:foo)
+(anchor:bar title:bar content:bar)
anchor:foo bar~10
title:foo bar~1000
content:foo bar~1000

(The latter three boost scores when terms are nearer.  Anchor proximity is 
limited, to keep from matching anchors from other documents.)

So, how should (foo AND (bar OR baz) expand?  Probably something like:

+(anchor:foo title:foo content:foo)
+((anchor:bar title:bar content:bar)
(anchor:baz title:baz content:baz))
... proximity boosting clauses?...

And (foo OR (bar AND baz)) might expand to:

(anchor:foo title:foo content:foo)
(+(anchor:bar title:bar content:bar)
 +(anchor:baz title:baz content:baz))
... proximity boosting clauses?...

This expansion is done by the query-basic plugin.


 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[Fwd: Nutch 0.9 and Crawl-Delay]

2007-06-04 Thread Doug Cutting
Does the 0.9 crawl-delay implementation actually permit multiple threads 
to access a site simultaneously?


Doug

 Original Message 
Subject: Nutch 0.9 and Crawl-Delay
Date: Sun, 3 Jun 2007 10:50:24 +0200
From: Lutz Zetzsche [EMAIL PROTECTED]
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

Dear Nutch developers,

I have had problems with a Nutch based robot during the last 12 hours,
which I have now solved by banning this particular bot from my server
(not Nutch completely for the moment). The ilial bot, which created
considerable load on my server, was using the latest Nutch version -
v0.9 - which is now also supporting the crawl-delay directive in the
robots.txt.

The bot seems to have obeyed the directive - crawl-delay: 10 - as it
visited my website every 15 seconds, which would have been ok, BUT it
then submitted FIVE requests at once (see example log extract below)! 5
requests at once every 15 seconds is not acceptable on my server, which
is principally serving dynamic content and is often visited by up to 10
search engines at the same time, alltogether surely creating 99.9% of
the server traffic.

So my suggestion is that Nutch only submits one request each time, when
it detects a crawl-delay directive in the robots.txt. This is the
behaviour, the MSNbot shows for example. The MSNbot also liked to
submit several requests at once every few seconds, until I added the
crawl-delay directive to my robots.txt.


Best wishes

Lutz Zetzsche
http://www.sea-rescue.de/



72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /english/Photos+%26+Videos/PV/ HTTP/1.0 200
13661 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /english/Links/WRGL/Countries/ HTTP/1.0 200
15048 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/
HTTP/1.0 200 60041 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
based Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
66.249.72.244 - - [03/Jun/2007:04:40:55
+0200] GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/
HTTP/1.1 200 17568 - Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)
66.231.189.119 - - [03/Jun/2007:04:40:55
+0200] GET 
/english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/ 


HTTP/1.0 200 17193 - Gigabot/2.0
(http://www.gigablast.com/spider.html)
74.6.86.105 - - [03/Jun/2007:04:40:56
+0200] GET /dansk/Links/Hermann+Apelt/ HTTP/1.0 200
30496 - Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/
HTTP/1.0 200 16658 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
based Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0
200 15624 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based
Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
[EMAIL PROTECTED])


[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500822
 ] 

Doug Cutting commented on NUTCH-392:


Anchors, explain, and the cache are used relatively infrequently, considerably 
less than once per query, and hence *much* less than once per displayed hit.  
So it might be acceptable if they're somewhat slower.  Block compression should 
still be fast-enough for interactive use, and these uses would never dominate 
CPU use in an application, would they?

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: https://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: proposal for committer

2007-05-29 Thread Doug Cutting
Personnel discussions are conducted on the PMC's private mailing list. 
I have forwarded your message there.


Thanks for the suggestion!

Doug

Gal Nitzan wrote:

Hi,

Since I'm no committer I can't really propose :-) but I just thought to draw 
some attention to the great work done on the dev/users lists and also the many 
patches created by Do?acan G?ney...


Just my 2 cents...

Gal.






Re: NUTCH-348 and Nutch-0.7.2

2007-05-24 Thread Doug Cutting

karthik085 wrote:

How do you find when a revision was released?


Look at the tags in subversion:

http://svn.apache.org/viewvc/lucene/nutch/tags/

Doug


Re: ApacheCon in Amsterdam

2007-04-23 Thread Doug Cutting

Tom White wrote:

I will be there too.


Unfortunately I won't be able to attend after all.  The new baby in the 
house won't let me!


Doug



Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Doug Cutting

Arun Kaundal wrote:

Actually nutch people are kind of autocrate., don't expect more from them
They do what they have decided


Have you submitted patches that have been ignored or rejected?

Each Nutch contributor indeed does what he or she decides.  Nutch is not 
a service organization that implements every feature that someone 
requests.  It is a collaborative project of volunteers.  Each 
contributor adds things they need, and others share the benefits.



I am waiting really stable product with
incremental indexing, which detect and add/remove pages as soon as they
added/removed. But they don't want to this, i don't know why ?


Perhaps because this is difficult, especially while still supporting 
large crawls.  But if others don't want to implement this, I encourage 
you to try to implement it, and, if you succeed, contribute it back to 
the project.  That's the way Nutch grows.



what is
there mission ? If we join together to implement this, it would be 
better. I

can work on this as weekend project.
ping me, if u want


You can of course fork Nutch, or start a new project from scratch.  But 
you ought to also consider submitting patches to Nutch, working work 
with other contributors to solve your problems here before abandoning 
Nutch in favor of another project.


Cheers,

Doug


Re: Image Search Engine Input

2007-03-29 Thread Doug Cutting

Steve Severance wrote:

I am not looking to really make an image retrieval engine. During indexing 
referencing docs will be analyzed and text content will be associated with the 
image. Currently I want to keep this in a separate index. So despite the fact 
that images will be returned the search will be against text data.


So do you just want to be able to reference the cached images?  In that 
case, I think the images should stay in the content directory and be 
accessed like cached pages.  The parse should just contain enough 
metadata to index so that the images can be located in the cache.  I 
don't see a reason to keep this in a separate index, but perhaps a 
separate field instead?  Then when displaying hits you can look up 
associated images and display them too.  Does that work?


Steve Severance wrote:

I like Mathijs's suggestion about using a DB for holding thumbnails. I just 
want access to be in constant time since I am going to probably need to grab at 
least 10 and maybe 50 for each query. That can be kept in the plugin as an 
option or something like that. Does that have any ramifications for being run 
on Hadoop?


I'm not sure how a database solves scalability issues.  It seems to me 
that thumbnails should be handled similarly to summaries.  They should 
be retrieved in parallel from segment data in a separate pass once the 
final set of hits to be displayed has been determined.  Thumbnails could 
be placed in a directory per segment as a separate mapreduce pass.  I 
don't see this as a parser issue, although perhaps it could be 
piggybacked on that mapreduce pass, which also processes content.


Doug


Re: svn commit: r516643 - in /lucene/nutch/trunk/src/plugin/parse-html/src: java/org/apache/nutch/parse/html/DOMContentUtils.java test/org/apache/nutch/parse/html/TestDOMContentUtils.java

2007-03-20 Thread Doug Cutting

[EMAIL PROTECTED] wrote:
[ ... ]

-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with

[ ... ]

+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with


This kind of thing is very unfortunate, since it makes it very difficult 
to figure out when particular lines were changed.  I recommend always 
previewing commits with something like 'svn diff | less' before 
committing so that you can be sure to *only* commit changes that you 
intend.  If your development environment does not permit you to preview 
the commit then please run subversion from the shell.


Doug


[jira] Commented: (NUTCH-455) dedup on tokenized fields is faulty

2007-03-07 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478854
 ] 

Doug Cutting commented on NUTCH-455:


Alternately, we could define it as an error to attempt to dedup by a tokenized 
field.  That's the (undocumented) expectation of FieldCache.  Using documents 
to populate a FieldCache for tokenized fields is very slow.  It's better to add 
an untokenized version and use that, no?  If you agree, then the more 
appropriate fix is to document the restriction and try to check for it at 
runtime.

 dedup on tokenized fields is faulty
 ---

 Key: NUTCH-455
 URL: https://issues.apache.org/jira/browse/NUTCH-455
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Fix For: 0.9.0

 Attachments: IndexSearcherCacheWarm.patch


 (From LUCENE-252) 
 nutch uses several index servers, and the search results from these servers 
 are merged using a dedup field for for deleting duplicates. The values from 
 this field is cached by Lucene's FieldCachImpl. The default is the site 
 field, which is indexed and tokenized. However for a Tokenized Field (for 
 example url in nutch), FieldCacheImpl returns an array of Terms rather that 
 array of field values, so dedup'ing becomes faulty. Current FieldCache 
 implementation does not respect tokenized fields , and as described above 
 caches only terms. 
 So in the situation that we are searching using url as the dedup field, 
 when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
 the url (such as www or com) rather that the whole url. This prevents 
 using tokenized fields in the dedup field. 
 I have written a patch for lucene and attached it in 
 http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
 aforementioned issue about tokenized field caching. However building such a 
 cache for about 1.5M documents takes 20+ secs. The code in 
 IndexSearcher.translateHits() starts with
 if (dedupField != null) 
   dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
 and for the first call of search in IndexSearcher, cache is built. 
 Long story short, i have written a patch against IndexSearcher, which in 
 constructor warms-up the caches of wanted fields(configurable). I think we 
 should vote for LUCENE-252, and then commit the above patch with the last 
 version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Issues pending before 0.9 release

2007-03-06 Thread Doug Cutting

Sami Siren wrote:

It would be more beneficial to everybody if the discussions (related to
release or Nutch) is
done on public (hey this is open source!). The off the list stuff IMO
smells.


+1  Folks sometimes wish to discuss project matters off-list to spare 
others the boring details, but this is usually a bad idea.  All project 
decisions should be made in public on this list.  Discussions relevant 
to these decisions are also thus best made on this list, since they 
explain the decision.  Private discussions are permissible to develop a 
proposal, but that is usually better done on-list when possible, so that 
others can get involved earlier.


(The one notable exception is that personnel issues are discussed on the 
private PMC list.)


Doug


Re: FW: Nutch release process help

2007-03-06 Thread Doug Cutting

Chris Mattmann wrote:

It's too bad that
this has turned out to be an issue that I've handled incorrectly, and for
that, I apologize.


Sorry if I blew this out of proportion.  We all help each other run this 
project.  I don't think any grave error was made.  I just saw an 
opportunity to remind folks to try to keep project discussions public, 
and did not mean to rebuke you.


I am thrilled that you want to take on the responsibility of making a 
release.  I very much do not want to damp your enthusiasm for that.


As you probably know, the release documentation is at:

http://wiki.apache.org/nutch/Release_HOWTO

This may need to be updated.  You might also look at the release 
documentation for other projects, to get ideas.


http://wiki.apache.org/lucene-hadoop/HowToRelease
http://wiki.apache.org/solr/HowToRelease
http://wiki.apache.org/jakarta-lucene/ReleaseTodo

Cheers,

Doug


Re: Nutch JSF front-end code submission - Please advice next steps?

2007-02-28 Thread Doug Cutting

Zaheed Haque wrote:

Its been about a month I been trying to find time to make the
necessary changes so that I could submit the code. Due to enormous
amount of work load I am unable to find the time. I am not sure how
should I proceed, I have personally try to contact some of you off
list. (Which I thought might be interested as they discuss more web
apps related issue on the list ). But seems like everyone is busy. So
I am trying my last effort here. I would love someone do something
with the code rather then it becomes obsolete.


For a start, please attach it to an issue in Jira, as-is, so that it is 
not lost.


Doug


[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter

2007-02-27 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243
 ] 

Doug Cutting commented on NUTCH-445:


Note that the site field is also used for search-time deduplication, and that 
assumes that each document has only one value for the field (returned from a 
Lucene FieldCache with raw hits, for performance).  So this feature should 
perhaps use a separate field.

That said, I think this should replace the current site-search feature, as it 
is an improvement and the industry-standard semantics.  So perhaps a site: 
query should search the domain: field?

 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: index_query_domain_v1.0.patch, 
 index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Performance optimization for Nutch index / query

2007-02-23 Thread Doug Cutting

Andrzej Bialecki wrote:
The degree of simplification is very substantial. Our NutchSuperQuery 
doesn't have to do much more work than a simple TermQuery, so we can 
assume that the cost to run it is the same as TermQuery times some 
constant. What we gain then is the cost of not running all those boolean 
clauses ...


The NutchSuperQuery would have to do more work, to boost things and 
since postings would be longer, and postings would also compress more 
poorly, so while there'd probably be some improvement, it wouldn't be 
quite as fast as a single-term query.


If you're still with me at this point I must congratulate you. :) 
However, that's as far as I thought it through for now - let the 
discussion start! If you are a Lucene hacker I would gladly welcome your 
review or even code contributions .. ;)


An implementation to consider is payloads.  If each posting has a weight 
attached, then the fieldBoost*fieldNorm could be stored there, and a 
simple gap-based method could be used to inhibit cross-field matches. 
Queries would look similar to your proposed approach.


http://www.gossamer-threads.com/lists/lucene/java-dev/37409

One might optimize the payload implementation with run-length 
compression: if a run of postings have the same payload it could be 
represented once at the start of the run along with the run's length. 
That would keep postings small, reducing i/o.


Doug



[jira] Assigned: (NUTCH-449) Format of junit output should be configurable

2007-02-23 Thread Doug Cutting (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting reassigned NUTCH-449:
--

Assignee: Doug Cutting

 Format of junit output should be configurable
 -

 Key: NUTCH-449
 URL: https://issues.apache.org/jira/browse/NUTCH-449
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
Reporter: Nigel Daley
 Assigned To: Doug Cutting
Priority: Minor
 Attachments: hudson.patch


 Allow the junit output format to be set by a system property.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



nightly builds moved to hudson

2007-02-23 Thread Doug Cutting

Nutch's nightly builds have been moved to a Hudson server at:

  http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/

I've stopped the old nightly build process and added a redirect from the 
old nightly build distribution directory to this page.


Thanks to Nigel Daley for configuring and maintaining the Hudson server!

Doug


[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-13 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472821
 ] 

Doug Cutting commented on NUTCH-443:


 this patch in some places removes the log guards

Most of the log guards are misguided.  Log guards should only be used on DEBUG 
level messages in performance-critical inner loops.  Since INFO is the expected 
log level, a guard on INFO  WARN level messages does not improve performance, 
since these will be shown.  And most DEBUG-level messages are not in 
performance critical code and hence do not need guards.  The guards only make 
the code bigger and thus harder to read and maintain.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



log guards

2007-02-13 Thread Doug Cutting

Doug Cutting (JIRA) wrote:

this patch in some places removes the log guards


Most of the log guards are misguided.  Log guards should only be used on DEBUG 
level messages in performance-critical inner loops.  Since INFO is the expected log 
level, a guard on INFO  WARN level messages does not improve performance, 
since these will be shown.  And most DEBUG-level messages are not in performance 
critical code and hence do not need guards.  The guards only make the code bigger 
and thus harder to read and maintain.


In particular, in all places where we check isWarnEnabled(), 
isFatalEnabled() and isInfoEnabled(), the 'if' should be removed.  All 
calls to isDebugEnabled() should be reviewed, and most should be removed.


These guards were all introduced by a patch some time ago.  I complained 
at the time and it was promised that this would be repaired, but it has 
not yet been.


Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting

Renaud Richardet wrote:
I see. I was thinking that I could index the feed items without having 
to fetch them individually.


Okay, so if Parser#parse returned a MapString,Parse, then the URL for 
each parse should be that of its link, since you don't want to fetch 
that separately.  Right?


So now the question is, how much impact would this change to the Parser 
API have on the rest of Nutch?  It would require changes to all Parser 
implementations, to ParseSegement, to ParseUtil, and to Fetcher.  But, 
as far as I can tell, most of these changes look straightforward.


Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Doug Cutting

Chris Mattmann wrote:

 Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...


A Content would no longer generate a single Parse.  Instead, a Content 
could potentially generate many Parses.  For most types of content, 
e.g., HTML, each Content would still generate a single Parse.  But for 
RSS, a Content might generate multiple Parses, each indexed separately 
and each with a distinct URL.


Another potential application could be processing archives: the parser 
could unpack the archive and each item in it indexed separately rather 
than indexing the archive as a whole.  This only makes sense if each 
item has a distinct URL, which it does in RSS, but it might not in an 
archive.  However some archive file formats do contain URLs, like that 
used by the Internet Archive.


http://www.archive.org/web/researcher/ArcFileFormat.php

Does that help?

Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting

Doğacan Güney wrote:

OK, then should I go forward with this and implement something?   This
should be pretty easy,
though I am not sure what to give as keys to a Parse[].

I mean, when getParse returned a single Parse, ParseSegment output them
as url, Parse. But, if getParse
returns an array, what will be the key for each element?


Perhaps Parser#parser could return a MapString,Parse, where the keys 
are URLs?



Something like url#i, Parse[i] may work, but this may cause problems
in dedup(for example,
assume we fetched the same rss feed twice, and indexed them in different
indexes. Two version's url#0 may be
different items but since they have the same key, dedup will delete the
older).


If the feed contains unique ids for items, then that can be used to 
qualify the URL.  Otherwise one could use the hash of the link of the item.


Since the target of the link must still be indexed separately from the 
item itself, how much use is all this?  If the RSS document is 
considered a single page that changes frequently, and item's links are 
considered ordinary outlinks, isn't much the same effect achieved?


Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Doug Cutting

Renaud Richardet wrote:
The usecase is that you index RSS-feeds, but your users can search each 
feed-entry as a single document. Does it makes sense?


But each feed item also contains a link whose content will be indexed 
and that's generally a superset of the item.  So should there be two 
urls indexed per item?  In many cases, the best thing to do is to index 
only the linked page, not the feed item at all.  In some (rare?) cases, 
there might be items without a link, whose only content is directly in 
the feed, or where the content in the feed is complementary to that in 
the linked page.  In these cases it might be useful to combine the two 
(the feed item and the linked content), indexing both.  The proposed 
change might permit that.  Is that the case you're concerned about?


Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-05 Thread Doug Cutting

Doğacan Güney wrote:
I think it would make much more sense to change parse plugins to take 
content and return Parse[] instead of Parse.


You're right.  That does make more sense.

Doug


Re: RSS-fecter and index individul-how can i realize this function

2007-02-02 Thread Doug Cutting

Gal Nitzan wrote:

IMHO the data that is needed i.e. the data that will be fetched in the next fetch process 
is already available in the item element. Each item element represents one 
web resource. And there is no reason to go to the server and re-fetch that resource.


Perhaps ProtocolOutput should change.  The method:

  Content getContent();

could be deprecated and replaced with:

  Content[] getContents();

This would require changes to the indexing pipeline.  I can't think of 
any severe complications, but I haven't looked closely.


Could something like that work?

Doug


Re: i18n in nutch home page is misnomor

2007-01-25 Thread Doug Cutting

Teruhiko Kurosaka wrote:

I suggest i18n be renamed to l10n, short for
localization.


Can you please file an issue in Jira for this?  Ideally you could even 
provide a patch.  The source for the website is in subversion at:


http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/site

Forrest is used to generate the site from this.

http://forrest.apache.org/

Doug


Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting

Scott Ganyo (JIRA) wrote:

 ... since Hadoop hijacks and reassigns all log formatters (also a bad 
practice!) in the org.apache.hadoop.util.LogFormatter static constructor ...


FYI, Hadoop no longer does this.

Doug


Re: Next Nutch release

2007-01-25 Thread Doug Cutting

Dennis Kubes wrote:

Andrzej Bialecki wrote:
I believe that at this point it's crucial to keep the project 
well-focused (at the moment I think the main focus is on larger 
installations, and not the small ones), and also to make Nutch 
attractive to developers as a reusable search engine component.


I think there are two areas.  One is to keep the focus as you stated 
above.  The other is to provide a path to get more people involved.  If 
no one objects I will continue working on such a path.


Please let me know if I can help in this people area.  I'm currently 
unable to assist with technical Nutch issues on a day-to-day basis, but 
I am still very interested in doing what I can to ensure Nutch's 
long-term vitality as a project.


Cheers,

Doug


Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting

Chris Mattmann wrote:

 So, does this render the patch that I wrote obsolete?


It's at least out-of-date and perhaps obsolete.  A quick read of 
Fetcher.java looks like there might be a case where a fatal error is 
logged but the fetcher doesn't exit, in FetcherThread#output().


Doug


Re: Finished How to Become a Nutch Developer

2007-01-23 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

Draft version of How to Become a Nutch Developer is on the wiki at:

http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

Please take a look and if you think anything needs to be added, removed,
or changed let me know.


Thanks for taking the time to write this up!  It looks great.

 I hope to meet, as we say it in Texas, ya'll in person one day.

Indeed!  Ironically, I met a bunch of Lucene developers for the first 
time in Texas last fall, at ApacheCon.  I hope to attend both ApacheCon 
EU this May in Amsterdam and ApacheCon US this November in Atlanta. 
Maybe I'll meet you there?


Doug


Re: How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting

Andrzej Bialecki wrote:
The workflow is different - I'm not sure about the details, perhaps Doug 
can correct me if I'm wrong ... and yes, it uses JIRA extensively.


1. An issue is created
2. patches are added, removed commented, etc...
3. finally, a candidate patch is selected, and the issue is marked 
Patch available.


Patch Available is code for the contributor now believes this is 
ready to commit.  Once a patch is in this state, a committer reviews it 
and either commits it or rejects it, changing the state of the issue 
back to Open.  The set of issues in Patch Available thus forms a 
work queue for committers.  We try not to let a patch sit in this state 
for more than a few days.


4. An automated process applies the patch to a temporary copy, and 
checks whether it compiles and passes junit tests.


This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't 
be hard to run this for Nutch on lucene.zones.apache.org, and I think 
Nigel would probably gladly share his scripts.  This step saves 
committers time: if a patch doesn't pass unit tests, or has javadoc 
warnings, etc. this can be identified automatically.


5. A list of patches in this state is available, and committers may pick 
from this list and apply them.
6. An explicit link is made between the issue and the change set 
committed to svn (Is this automated?)


Jira does this based on commit messages.  Any bug ids mentioned in a 
commit message create links from that bug to the revision in subversion. 
 Hadoop commits messages usually start with the bug id, e.g., 
HADOOP-1234.  Remove a deadlock in the oscillation overthruster.


7. The issue is marked as Resolved, but not closed. I believe issues 
are closed only when a release is made, because issues in state 
resolved make up the Changelog. I believe this is also automated.


Jira will put resolved issues into the release notes regardless of 
whether they're closed.  The reason we close issues on release is to 
keep folks from re-opening them.  We want the release notes to be the 
list of changes in a release, so we don't want folks re-opening issues 
and having new commits made against them, since then the changes related 
to the issue will span multiple releases.  If an issue is closed but 
there's still a problem, a new issue should be created linking to the 
prior issue, so that the new issue can be scheduled and tracked without 
modifying what should be a read-only release.


Doug




Re: Reviving Nutch 0.7

2007-01-22 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

Yes, certainly, anything that can be shared and decoupled from pieces that make 
each branch (not SVN/CVS branch) different, should be decoupled.  But I was 
really curious about whether people think this is a valid idea/direction, not 
necessarily immediately how things should be implemented.  In my mind, one 
branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, 
etc.  That's the branch that's in the trunk.  The other branch is a simpler 
branch without all that Hadoop stuff, for folks who need to fetch, index, and 
search a few hundred thousand or a few million or even a few tens of millions 
of pages, and don't need replication, etc. that comes with Hadoop.  That branch 
could be based off of 0.7.  I also know that a lot of people are trying to use 
Nutch to build vertical search engines, so there is also a need for a focused 
fetcher.  Kelvin Tan brought this up a few times, too, I believe.


Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch: 
everything should run fine in a single process by default.  If there are 
bugs in this they should be logged, folks who care should submit 
high-quality, back-compatible, generally useful patches, and committers 
should work to get these patches committed to the trunk.


Second, if there are to be two modes of operation, wouldn't they best be 
developed in a common source tree, so that they share as much as 
possible and diverge as little as possible?  It seems to me that a good 
architecture would be to agree on a common high-level API, then use two 
different runtimes underneath, one to support distributed operation, and 
one to support standalone operation.  Hey!  That's what Hadoop already 
does!  Maybe it's not perfect and someone can propose a better way to 
share maximal amounts of code, but the code split should probably be 
into different classes and packages in a single source tree maintained 
by a single community of developers, not by branching a single source 
tree in a revision control and splitting the developers.


Third, part of the problem seems like there are two few 
contributors--that the challenges are big and the resources limited. 
Splitting the project will only spread those resources more thinly.


What really is the issue here?  Are good patches languishing?  Are there 
patches that should be committed (meet coding standards, are 
back-compatible, generally useful, etc.) but are not?  A great patch is 
one that a committer can commit it with few worries: it includes new 
unit tests, it passes all existing unit tests, it fixes one thing only, 
etc.  Such patches should not have to wait long for commit.  And once 
someone submits a few such patches, then one should be invited to become 
a committer.


It sounds to me like the problem is that, off-the-shelf, Nutch does not 
yet solve all the problems folks would like it too: e.g., it has never 
done a good job with incremental indexing.  Folks see progress made on 
scalability, but really wish it were making more progress on 
incrementality or something else.  But it's not going to make progress 
on incrementality without someone doing the work.  A fork or a branch 
isn't going to do the work.  I don't see any reason that the work cannot 
be done right now.  It can be done incrementally: e.g., if the web db 
API seems inappropriate for incremental updates, then someone should 
submit a patch that provides an incremental web db API, updating the 
fetcher and indexer to use this.  A design for this on the wiki would be 
a good place to start.


Finally, web crawling, indexing and searching are data-intensive. 
Before long, users will want to index tens or hundreds of millions of 
pages.  Distributed operation is soon required at this scale, and 
batch-mode is an order-of-magnitude faster.  So be careful before you 
threw those features out: you might want them back soon.


Doug




Re: How to Become a Nutch Developer

2007-01-22 Thread Doug Cutting

Dennis Kubes wrote:
Can you answer the question of how to add developer names to JIRA or if 
that is only for committers?


It's not just for committers, but also for regular contributors.  I have 
added you.  Anyone else?


Doug


Re: Next Nutch release

2007-01-19 Thread Doug Cutting

Stefan Groschupf wrote:
I don't want to start a emotional discussion here, however talking about 
the problem in public might help.


What, specifically, is the problem you perceive?

Doug


Re: Next Nutch release

2007-01-19 Thread Doug Cutting

Dennis Kubes wrote:
I will say that it is difficult for people to understand how to get more 
involved.  I have been working with Nutch and Hadoop for almost a year 
now on a daily basis and only now am I understanding how to contribute 
through jira, etc.  There needs to be more guidance in helping 
developers contribute.  For example if you want to develop a new piece 
of function they do x, y, and z.  Here is how to patch your system. If 
you want to develop a patch then here are the steps.


The closest thing we have currently are the HowToContribute pages:

http://wiki.apache.org/nutch/HowToContribute
http://wiki.apache.org/lucene-hadoop/HowToContribute
http://wiki.apache.org/jakarta-lucene/HowToContribute

These are not great, but they're a start.  Are there parts that are 
confusing?  Do they assume too much?  Are they missing things?  If so, 
please help to update these.


I note that the Nutch version is less evolved than the Lucene and Hadoop 
versions.


Doug



Re: Next Nutch release

2007-01-18 Thread Doug Cutting

Stefan Groschupf wrote:
We run the gui in several production environemnts with patched hadoop 
code - since this is from our point of view the clean approach. 
Everything else feels like a workaround to fix some strange hadoop 
behaviors.


Are there issues in Hadoop's Jira for these?  If so, do they have 
patches attached?  Are they linked to the corresponding issue in Nutch?


Doug


Re: How can I get one plugin's root dir

2007-01-16 Thread Doug Cutting

Andrzej Bialecki wrote:
The reason is that if you pack this file into your job JAR, the job jar 
would become very large (presumably this 40MB is already compressed?). 
Job jar needs to be copied to each tasktracker for each task, so you 
will experience performance hit just because of the size of the job jar 
... whereas if this file sits on DFS and is highly replicated, its 
content will always be available locally.


Note that the job jar is copied into HDFS with a highish replication 
(10?), and that it is only copied to each tasktracker node once per 
*job*, not per task.  So it's only faster to manage this yourself if you 
have a sequence of jobs that share this data, and if the time to 
re-replicate it per job is significant.


Doug


Re: Brochure for Nutch

2006-12-08 Thread Doug Cutting

The wiki would be a good place for this.

Doug

Peter Landolt wrote:

Hello,

We tried to introduce Nutch at a telecommunication company in Switzerland
as search engine of their future main search solution. As they were also 
proofing

commercial products we needed to offer them a brochure to make sure they
understand Nutch as a product with its features, references, etc. This 
document
was mainly created to advise the credability of  Nutch. Attached please 
find the

document.

Now we would like you to proof and improve the document to publish it on
the web.

Thanks for your feedback and best regards,
Peter



Re: What's the status of Nutch-GUI?

2006-12-08 Thread Doug Cutting

Sami Siren wrote:

Stefan Groschupf wrote:

See:
http://www.find23.net/nutch_guiToHadoop.pdf
Section required hadoop changes.


I quess you refer to these:

•  LocalJobRunner:
  •  Run as kind of singelton
  •  Have a kind of jobQueue
  •  Implement JobSubmissionProtocol status-report
 methods
  •  implement killJob method


Is there an issue in Hadoop's Jira for this?  Is there a patch that 
implements these?  If there is, then I suggest folks vote for the issue.


-how about writing a nutchrunner that just extends the functionality of 
localjobrunner?

-scheduling (jobQueue) could be completely outside of jobrunner?


These also sounds like a good solutions.  If it is not Nutch-specific, 
then perhaps it could be integrated into Hadoop, so that it is 
maintained as Hadoop evolves.  If that sounds like a good approach, 
please submit a patch to Hadoop with some unit tests.


Cheers,

Doug


[jira] Assigned: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting reassigned NUTCH-392:
--

Assignee: Doug Cutting

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: http://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
 Assigned To: Doug Cutting

 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting updated NUTCH-392:
---

Attachment: NUTCH-392.patch

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: http://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
 Assigned To: Doug Cutting
 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-392?page=comments#action_12444719 ] 

Doug Cutting commented on NUTCH-392:


This should not be applied until Nutch uses Hadoop 0.8.  It also contains a 
patch required to make Nutch work correctly with Hadoop 0.8 (where 
LocalFileSystem.rename() of a non-existing file now throws an exception).

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: http://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
 Assigned To: Doug Cutting
 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting updated NUTCH-392:
---

Attachment: (was: NUTCH-392.patch)

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: http://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
 Assigned To: Doug Cutting
 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-392) OutputFormat implementations should pass on Progressable

2006-10-25 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-392?page=all ]

Doug Cutting updated NUTCH-392:
---

Attachment: NUTCH-392.patch

Oops.  Attached the wrong patch.  Here's the right one.

 OutputFormat implementations should pass on Progressable
 

 Key: NUTCH-392
 URL: http://issues.apache.org/jira/browse/NUTCH-392
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Doug Cutting
 Assigned To: Doug Cutting
 Attachments: NUTCH-392.patch


 OutputFormat implementations should pass the Progressable they are passed to 
 underlying SequenceFile implementations.  This will keep reduce tasks from 
 timing out when block writes are slow.  This issue depends on 
 http://issues.apache.org/jira/browse/HADOOP-636.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: email to jira comments (WAS Re: [jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements)

2006-10-16 Thread Doug Cutting

Sami Siren wrote:
looks like somebody just enabled email-to-jira-comments-feature. I was 
just wondering would it be good to use this feature more widely.


I think it would be good.  That way mailing list discussion would be 
logged to the bug as well.


This could be achieved by removing the replyto header from messages 
coming from jira so that replies get sent to [EMAIL PROTECTED] (i am 
assuming that is possible). So whenever somebody just hits reply

from email client and writes the comment it would get automatically
attached to correct issue as a comment.


I sent a message to [EMAIL PROTECTED] this morning asking about this. 
If it's possible, and no one objects, I will request it for the Nutch 
mailing lists.


Doug


[jira] Resolved: (NUTCH-304) Change JIRA email address for nutch issues from apache incubator

2006-10-03 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-304?page=all ]

Doug Cutting resolved NUTCH-304.


Resolution: Fixed

I just fixed this.  Thanks for noticing!

 Change JIRA email address for nutch issues from apache incubator
 

 Key: NUTCH-304
 URL: http://issues.apache.org/jira/browse/NUTCH-304
 Project: Nutch
  Issue Type: Task
 Environment: Dell Pentium M mobile 1.4 Ghz, 512 MB RAM, although task 
 is independent of environment
Reporter: Chris A. Mattmann
Priority: Minor

 The default email address for Nutch issues in JIRA should be changed from 
 nutch-dev@incubator.apache.org to [EMAIL PROTECTED] Could one of the 
 commiters with appropriate jira privileges update the email?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-10-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439682 ] 

Doug Cutting commented on NUTCH-353:


It's worth noting that Google, Yahoo! and Microsoft's searches all return lots 
of links to www-XXX.ibm.com.  Just some evidence that this may not be an easy 
problem to solve.

 pages that serverside forwards will be refetched every time
 ---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Blocker
 Fix For: 0.9.0

 Attachments: doNotRefecthForwarderPagesV1.patch


 Pages that do a serverside forward are not written with a status change back 
 into the crawlDb. Also the nextFetchTime is not changed. 
 This causes a refetch of the same page again and again. The result is nutch 
 is not polite and refetching the forwarding and target page in each segment 
 iteration. Also it effects the scoring since the forward page contribute it's 
 score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Patch Available status?

2006-08-31 Thread Doug Cutting

Chris Mattmann wrote:

  +1. I think that workflow makes a lot of sense. Currently users in the
nutch-developers group can close and resolve issues. In the Hadoop workflow,
would this continue to be the case?


In Hadoop, most developers can resolve but not close.  Only members of a 
separate Jira group (hadoop-admin a subset of hadoop-developers) are 
permitted to close bugs.  Note that the Jira group hadoop-developers has 
far more members than Hadoop has committers.


But the nutch-developers Jira group pretty closely corresponds to 
Nutch's committers, so perhaps all committers should be permitted to 
close, although this should be exercised with caution, only at releases, 
since closes cannot be undone in this workflow.


Another alternative would be to construct a new workflow that just adds 
the Patch Available status and still permits issues to be re-opened.


Which sounds best for Nutch?

Doug


Re: Patch Available status?

2006-08-30 Thread Doug Cutting

Sami Siren wrote:
I am not able to do it either, or then I just don't know how, can Doug 
help us here?


This requires a change the the project's workflow.  I'd be happy to move 
Nutch to use the workflow we use for Hadoop, which supports Patch 
Available.


This workflow has one other non-default feature, which is that bugs, 
once closed, cannot be re-opened.  This works as follows: Only project 
administrators are allowed to close issues.  Bugs are resolved as 
they're fixed, and only closed when a release is made.  This keeps the 
release notes Jira generates from changing after a release is made.


Would you like me to switch Nutch to use this Jira workflow?

Doug


Re: Error with Hadoop-0.4.0

2006-07-12 Thread Doug Cutting

Sami Siren wrote:

Patch works for me.


OK.  I just committed it.

Thanks!

Doug


Re: Error with Hadoop-0.4.0

2006-07-10 Thread Doug Cutting

Jérôme Charron wrote:

In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient 
(JobClient.java:submitJob(273))

- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread main java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
   at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


Hadoop 0.4.0 by default requires all input directories to exist, where 
previous releases did not.  So we need to either create an empty 
current directory or change the InputFormat used in 
CrawlDb.createJob() to be one that overrides 
InputFormat.areValidInputDirectories().  The former is probably easier. 
 I've attached a patch.  Does this fix things for folks?


Doug
Index: src/java/org/apache/nutch/crawl/CrawlDb.java
===
--- src/java/org/apache/nutch/crawl/CrawlDb.java	(revision 417882)
+++ src/java/org/apache/nutch/crawl/CrawlDb.java	(working copy)
@@ -65,7 +65,8 @@
 if (LOG.isInfoEnabled()) { LOG.info(CrawlDb update: done); }
   }
 
-  public static JobConf createJob(Configuration config, Path crawlDb) {
+  public static JobConf createJob(Configuration config, Path crawlDb)
+throws IOException {
 Path newCrawlDb =
   new Path(crawlDb,
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
@@ -73,7 +74,11 @@
 JobConf job = new NutchJob(config);
 job.setJobName(crawldb  + crawlDb);
 
-job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
+
+Path current = new Path(crawlDb, CrawlDatum.DB_DIR_NAME);
+if (FileSystem.get(job).exists(current)) {
+  job.addInputPath(current);
+}
 job.setInputFormat(SequenceFileInputFormat.class);
 job.setInputKeyClass(UTF8.class);
 job.setInputValueClass(CrawlDatum.class);


[jira] Reopened: (NUTCH-309) Uses commons logging Code Guards

2006-07-07 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-309?page=all ]
 
Doug Cutting reopened NUTCH-309:



I am re-opening this issue, as the guards were added in far too many places.  
Jerome, can you please fix these so that guards are only added when (a) the log 
level is DEBUG or TRACE, (b) it occurs in performance-critical code, and (c) 
the logged string is not constant.

 Uses commons logging Code Guards
 

  Key: NUTCH-309
  URL: http://issues.apache.org/jira/browse/NUTCH-309
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Jerome Charron
 Assignee: Jerome Charron
 Priority: Minor
  Fix For: 0.8-dev


 Code guards are typically used to guard code that only needs to execute in 
 support of logging, that otherwise introduces undesirable runtime overhead in 
 the general case (logging disabled). Examples are multiple parameters, or 
 expressions (e.g. string +  more) for parameters. Use the guard methods of 
 the form log.isPriority() to verify that logging should be performed, 
 before incurring the overhead of the logging method call. Yes, the logging 
 methods will perform the same check, but only after resolving parameters.
 (description extracted from 
 http://jakarta.apache.org/commons/logging/guide.html#Code_Guards)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Resolved: (NUTCH-312) Fix for upcoming incompatibility with Hadoop-0.4

2006-06-28 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-312?page=all ]
 
Doug Cutting resolved NUTCH-312:


Fix Version: 0.8-dev
 Resolution: Fixed

I just upgraded Nutch to Hadoop 0.4.0, incorporating this patch.  Thanks, 
Milind!

 Fix for upcoming incompatibility with Hadoop-0.4
 

  Key: NUTCH-312
  URL: http://issues.apache.org/jira/browse/NUTCH-312
  Project: Nutch
 Type: Improvement

  Environment: all
 Reporter: Milind Bhandarkar
  Fix For: 0.8-dev
  Attachments: nutch-latest.patch, nutch.patch

 I have submitted a patch to Hadoop fixing tasktracker-latency issues. That 
 patch introduces incompatibility with current nutch code, because the 
 interface for OutputFormat will change. I will soon submit a patch for nutch 
 that will fix this upcoming incompatibility with Hadoop.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: svn commit: r416346 [1/3] - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/clustering/ java/org/apache/nutch/crawl/ java/org/apache/nutch/fetcher/ java/org/apach

2006-06-22 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

NUTCH-309 : Added logging code guards

[ ... ]

+  if (LOG.isWarnEnabled()) {
+LOG.warn(Line does not contain a field name:  + line);
+  }

[ ...]

-1

I don't think guards should be added everywhere.  They make the code 
bigger and provide little benefit.  Rather, guards should only be added 
in performance critical code, and then only for Debug-level output. 
Info and Warn levels are normally enabled, and developers should 
thus not log messages at these levels so frequently that performance 
will be compromised.  And not all Debug-level log statements need 
guards, only those that are in inner loops, where the construction of 
the log message may significantly affect performance.


Doug


IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?

2006-06-14 Thread Doug Cutting

http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.html


Re: Nutch logging questions

2006-06-09 Thread Doug Cutting

Jérôme Charron wrote:

For now, I have used the same log4 properties than hadoop (see
http://svn.apache.org/viewvc/lucene/hadoop/trunk/conf/log4j.properties?view=markuppathrev=411254 


) for the back-end, and
I was thinking to use the stdout for front-end.
What do you think about this?


We should use console rather than stdout, so that it can be 
distinguished from application output.


http://issues.apache.org/jira/browse/HADOOP-292

Doug


Re: svn commit: r411943 - in /lucene/nutch/trunk/lib: commons-logging-1.0.4.jar hadoop-0.2.1.jar hadoop-0.3.1.jar log4j-1.2.13.jar

2006-06-06 Thread Doug Cutting

Stefan Groschupf wrote:
As far I understand hadoop use commons logging. Should we switch to use 
commons logging as well?


+1

Doug


[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-31 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414114 ] 

Doug Cutting commented on NUTCH-289:


It should be possible to partition by IP and limit fetchlists by IP.  Resolving 
only in the fetcher is too late to implement these features.   Ideally we 
should arrange things for good DNS cache utilization, so that urls with the 
same host are resolved in a single map or reduce task.  Currently this is the 
case during fetchlist generation, where lists are partitioned by host.  Might 
that be a good place to insert DNS resolution?  The fetchlists would need to be 
processed one more time, to re-partition and re-limit by IP, but fetchlists are 
relatively small, so this might not slow things too much.  The map task itself 
could directly cache IP addresses, and perhaps even avoid many DNS lookups by 
using the IP from another CrawlDatum from the same host.  A multi-threaded 
mapper might also be used to allow for network latencies.

This should, at least initially, be an optional feature, and thus the IP should 
probably initially be stored in the metadata.  I think it might be added as a 
re-generate step without changing any other code.


 CrawlDatum should store IP address
 --

  Key: NUTCH-289
  URL: http://issues.apache.org/jira/browse/NUTCH-289
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
 Reporter: Doug Cutting


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-30 Thread Doug Cutting

Ken Krugler wrote:

2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer 
such as
myself it appears as though webmasters are not getting many replies 
to their

inqueries.



I can speak for myself only .. I'm not tracking that list. What about 
others?


Folks who are running a nutch-based crawler that provides this email 
address as the contact address should subscribe to this list and respond 
to messages, especially those which may have been caused by their 
crawler.  Others are also encouraged to subscribe and help respond to 
messages here, as a bad reputation for the crawler affects the whole 
project.  This list is actually fairly low-volume.


This brings up an issue I've been thinking about. It might make sense to 
require everybody set the user-agent string, versus it having default 
values that point to Nutch.


The first time you run Nutch, it would display an error re the 
user-agent string not being set, but if the instructions for how to do 
this were explicit, this wouldn't be much of a hardship for anybody 
trying it out.


+1

That would be a better solution.

Doug


[jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-05-26 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12413528 ] 

Doug Cutting commented on NUTCH-273:


Redirects should really not be followed immediately anyway.  We should instead 
note that it was redirected and to which URL in the fetcher output.  Then, when 
the crawl db is updated with the fetcher output, the target of the redirect 
should be added, with the full OPIC score of the original URL.  This will 
enable proper politeness guarantees.

It would be nice to still associate the original URL with the content of the 
redirect URL when indexing.  Perhaps a list of URLs that redirected to each 
page could be kept in the CrawlDatum metadata?  Can anyone think of a better 
way to implement this?


 When a page is redirected, the original url is NOT updated.
 ---

  Key: NUTCH-273
  URL: http://issues.apache.org/jira/browse/NUTCH-273
  Project: Nutch
 Type: Bug

   Components: fetcher
 Versions: 0.8-dev
  Environment: n/a
 Reporter: Lukas Vlcek


 [Excerpt from maillist, sender: Andrzej Bialecki]
 When a page is redirected, the original url is NOT updated - so, CrawlDB will 
 never know that a redirect occured, it won't even know that a fetch 
 occured... This looks like a bug.
 In 0.7 this was recorded in the segment, and then it would affect the Page 
 status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-289) CrawlDatum should store IP address

2006-05-26 Thread Doug Cutting (JIRA)
CrawlDatum should store IP address
--

 Key: NUTCH-289
 URL: http://issues.apache.org/jira/browse/NUTCH-289
 Project: Nutch
Type: Bug

  Components: fetcher  
Versions: 0.8-dev
Reporter: Doug Cutting


If the CrawlDatum stored the IP address of the host of it's URL, then one could:

- partition fetch lists on the basis of IP address, for better politeness;
- truncate pages to fetch per IP address, rather than just hostname.  This 
would be a good way to limit the impact of domain spammers.

The IP addresses could be resolved when a CrawlDatum is first created for a new 
outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413272 ] 

Doug Cutting commented on NUTCH-288:


 Is there a performant way of doing deduplication and knowing for sure how 
 many documents are available to view?

No.  But we should probably handle this situation without throwing exceptions.

For example, look at the following:

http://www.google.com/search?q=emacs+%22doug+cutting%22start=90

Click on the page 19 link at the bottom.  It takes you to page 16, the last 
page after deduplication.


 hitsPerSite-functionality flawed: problems writing a page-navigation
 --

  Key: NUTCH-288
  URL: http://issues.apache.org/jira/browse/NUTCH-288
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
 to problems when trying to offer a page-navigation (e.g. allow the user to 
 jump to page 10). This is because dedup is done after fetching.
 RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
 it to display 10 items per page. My naive approach was to estimate I have 
 7763/10 = 777 pages. But already when moving to page 3 I got no more 
 searchresults (I guess because of dedup). And when moving to page 10 I  got 
 an exception (see below).
 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
 servlet OpenSearch threw exception
 java.lang.NegativeArraySizeException
 at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
 at 
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 Only workaround I see for the moment: Fetching RSS without duplication, dedup 
 myself and cache the RSS-result to improve performance. But a cleaner 
 solution would imho be nice. Is there a performant way of doing deduplication 
 and knowing for sure how many documents are available to view? For sure this 
 would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-288) hitsPerSite-functionality flawed: problems writing a page-navigation

2006-05-25 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12413305 ] 

Doug Cutting commented on NUTCH-288:


  Is there a quickfix possible somehow?

Someone needs to fix the OpenSearch servlet.

It looks like just changing line 146 of OpenSearchServlet.java, replacing:

Hit[] show = hits.getHits(start, end-start);

with:

Hit[] show = hits.getHits(start, length  0 ? length : 0);

Give this a try.

 hitsPerSite-functionality flawed: problems writing a page-navigation
 --

  Key: NUTCH-288
  URL: http://issues.apache.org/jira/browse/NUTCH-288
  Project: Nutch
 Type: Bug

   Components: web gui
 Versions: 0.8-dev
 Reporter: Stefan Neufeind


 The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
 to problems when trying to offer a page-navigation (e.g. allow the user to 
 jump to page 10). This is because dedup is done after fetching.
 RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
 it to display 10 items per page. My naive approach was to estimate I have 
 7763/10 = 777 pages. But already when moving to page 3 I got no more 
 searchresults (I guess because of dedup). And when moving to page 10 I  got 
 an exception (see below).
 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
 servlet OpenSearch threw exception
 java.lang.NegativeArraySizeException
 at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
 at 
 org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
 at 
 org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
 at 
 org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
 at 
 org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
 at 
 org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:595)
 Only workaround I see for the moment: Fetching RSS without duplication, dedup 
 myself and cache the RSS-result to improve performance. But a cleaner 
 solution would imho be nice. Is there a performant way of doing deduplication 
 and knowing for sure how many documents are available to view? For sure this 
 would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-11 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12379116 ] 

Doug Cutting commented on NUTCH-267:


re: it's as if we didn't want it to be re-crawled if we can't find any inlinks 
to it

We prioritize crawling based on the number of pages we've crawled that link to 
it since we've last crawled it.  Assuming it had links to it that caused it to 
be crawled the first time, and that some of those will also be re-crawled, then 
its score will again increase.  But if no one links to it anymore, it will 
languish, and not be crawled again unless there're no higher-scoring pages.  
That sounds right to me, and I think it's what's suggested in the OPIC paper 
(if i skimmed it correctly).

Perhaps it should not be reset to zero, but one, since that's where pages start 
out.

re: why use sqrt(opic) * docSimilarity instead of log(opic * docSimilarity)

Wrapping log() around things changes the score value but not the ranking.  So 
the question is really, why use sqrt(opic)*docSimilarity and not just 
opic*docSimilarity?  The answer is simply that I tried a few queries and sqrt 
seemed to be required for OPIC to not overly dominate scoring.  It was a seat 
of the pants calculation, trying to balance the strength of anchor matches, 
opic scoring and title, url and body matching, etc.  One can disable this by 
changing the score power parameter.

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Doug Cutting

Jérôme Charron wrote:

This means there's no markup in the OpenSearch output?



Yes, no markup for now.


Doesn't this break any existing application that uses OpenSearch and 
displays summaries in a web browser?  This is an incompatible change 
which we should avoid.



Shouldn't there be?



The restriction on description field is : Can contain simple escaped HTML
markup, such as b, i, a, and img elements.
So, ya, why not. We can add b around highlights.
What you and others thinks?


+1


Perhaps this should be a method on Summary, to render it as html?



I had some hesitations about this while coding 
In fact, as suggested in the issue's comments, I would like to add a 
generic

method on Summary :
String toString(Encoder, Formatter) like in the Lucene's Highlighter and
provide some basic implementations of Encoder and Formatter.


That sounds fine, but in the meantime, let's not reproduce the 
html-specific code in lots of places.  We need it in both search.jsp and 
in OpenSearchServlet.java.  So we should have it in a common place.  A 
method on Summary seems like a good place.  If we subsequently add a 
more general API then we could re-implement the toHtml() method using 
that API, but I think a generic toHtml() method will be useful for quite 
a while yet.


Doug


Re: dfs -report

2006-05-10 Thread Doug Cutting

This is a known, fixed, Hadoop bug:

  http://issues.apache.org/jira/browse/HADOOP-201

I'm going to release Hadoop 0.2.1 with this and one other patch as soon 
as Subversion is back up, then upgrade Nutch to use 0.2.1.


Doug

Marko Bauhardt wrote:

Hi all,
i start nutch-0.8-dev (Revision 405738) on distributed filesystem.
If i execute bin/hadoop dfs -report  an exception occurs.

java.lang.RuntimeException: java.lang.IllegalAccessException: Class  
org.apache.hadoop.io.WritableFactories can not access a member of  class 
org.apache.hadoop.dfs.DatanodeInfo with modifiers public
at org.apache.hadoop.io.WritableFactories.newInstance 
(WritableFactories.java:49)
at org.apache.hadoop.io.ObjectWritable.readObject 
(ObjectWritable.java:226)
at org.apache.hadoop.io.ObjectWritable.readObject 
(ObjectWritable.java:163)
at org.apache.hadoop.io.ObjectWritable.readObject 
(ObjectWritable.java:211)
at org.apache.hadoop.io.ObjectWritable.readFields 
(ObjectWritable.java:60)

at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170)
Caused by: java.lang.IllegalAccessException: Class  
org.apache.hadoop.io.WritableFactories can not access a member of  class 
org.apache.hadoop.dfs.DatanodeInfo with modifiers public

at sun.reflect.Reflection.ensureMemberAccess(Reflection.java: 65)
at java.lang.Class.newInstance0(Class.java:344)
at java.lang.Class.newInstance(Class.java:303)
at org.apache.hadoop.io.WritableFactories.newInstance 
(WritableFactories.java:45)

... 5 more

What i doing wrong?


Marko



Re: svn commit: r405565 - in /lucene/nutch/trunk/src: java/org/apache/nutch/searcher/ test/org/apache/nutch/searcher/ web/jsp/

2006-05-10 Thread Doug Cutting

Jérôme Charron wrote:

Yes Doug, but in fact, the idea is to add the toString(Formatter) method in
a common place (Summary).
And add one specific Formatter implementation for OpenSearch and another 
one

for search.jsp :
The reason is that they should not use the same HTML code :
1. OpenSearch should only use b around highlights
2. search.jsp should use some more complicated HTML code (span ... )

In fact, I don't know if the Formatter solution is the good one, but the
toString() or toHtml() must be parametrized
since the two pieces of code that use this method should have distinct
outputs.


This all sounds fine, I'm just remarking that, at present, the 
OpenSearch output has changed incompatibly, which is a bad thing, and 
that I wish, until this is fully worked out, OpenSearch returned what it 
did before (markup, although perhaps exceeding what's advised).


Doug


[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-09 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378765 ] 

Doug Cutting commented on NUTCH-267:


Andrzej: your analysis is correct, but it mostly only applies when re-crawling. 
 In an initial crawl, where each url is fetched only once, I think we implement 
 the OPIC Greedy strategy.  The question of what to do when re-crawling has 
not been adequately answered, but, glancing at the paper, it seems that 
resetting a urls score to zero each time it is fetched might be the best thing 
to do, so that it can start accumulating more cash.

When ranking, summing logs is the same as multiplying, no?

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2006-05-08 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ] 

Doug Cutting commented on NUTCH-134:


+1 for Summary as Writable and change HitSummarizer.getSummary() to return a 
Summary directly rather than a String.  I don't think this has bad performance 
implications.

 Summarizer doesn't select the best snippets
 ---

  Key: NUTCH-134
  URL: http://issues.apache.org/jira/browse/NUTCH-134
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: summarizer.060506.patch

 Summarizer.java tries to select the best fragments from the input text, where 
 the frequency of query terms is the highest. However, the logic in line 223 
 is flawed in that the excerptSet.add() operation will add new excerpts only 
 if they are not already present - the test is performed using the Comparator 
 that compares only the numUniqueTokens. This means that if there are two or 
 more excerpts, which score equally high, only the first of them will be 
 retained, and the rest of equally-scoring excerpts will be discarded, in 
 favor of other excerpts (possibly lower scoring).
 To fix this the Set should be replaced with a List + a sort operation. To 
 keep the relative position of excerpts in the original order the Excerpt 
 class should be extended with an int order field, and the collected 
 excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-267) Indexer doesn't consider linkdb when calculating boost value

2006-05-08 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-267?page=comments#action_12378560 ] 

Doug Cutting commented on NUTCH-267:


The OPIC score is much like a count of incoming links, but a bit more refined.  
OPIC(P) is one plus the sum of the OPIC contributions for all links to a page.  
The OPIC contribution of a link from page P is OPIC(P) / numOutLinks(P).

 Indexer doesn't consider linkdb when calculating boost value
 

  Key: NUTCH-267
  URL: http://issues.apache.org/jira/browse/NUTCH-267
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.8-dev
 Reporter: Chris Schneider
 Priority: Minor


 Before OPIC was implemented (Nutch 0.7, very early Nutch 0.8-dev), if 
 indexer.boost.by.link.count was true, the indexer boost value was scaled 
 based on the log of the # of inbound links:
 if (boostByLinkCount)
   res *= (float)Math.log(Math.E + linkCount);
 This is no longer true (even before Andrzej implemented scoring filters). 
 Instead, the boost value is just the square root (or some other scorePower) 
 of the page score. Shouldn't the invertlinks command, which creates the 
 linkdb, have some affect on the boost value calculated during indexing 
 (either via the OPICScoringFilter or some other built-in filter)?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: generate.max.per.host is per reduce task

2006-05-07 Thread Doug Cutting

Chris Schneider wrote:
I just noticed that the generate.max.per.host property is only enforced 
on a per reduce task basis during the first generate job (see 
Generator.Selector.reduce for details). At a minimum, it should probably 
be documented this way in nutch-default.xml.template.


Yes, but all URLs with the same host are a single reduce task, since it 
is generating host-disjoint fetch lists.


Doug


Re: svn commit: r399515 - /lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

2006-05-05 Thread Doug Cutting
This sort of error will become much harder to make once we upgrade to 
Hadoop 0.2 and replace most uses of java.io.File with 
org.apache.hadoop.fs.Path.


Doug

[EMAIL PROTECTED] wrote:

Author: ab
Date: Wed May  3 19:42:02 2006
New Revision: 399515

URL: http://svn.apache.org/viewcvs?rev=399515view=rev
Log:
Use the FileSystem instead of java.io.File.exists().

Modified:
lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

Modified: 
lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
URL: 
http://svn.apache.org/viewcvs/lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java?rev=399515r1=399514r2=399515view=diff
==
--- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java 
(original)
+++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Wed 
May  3 19:42:02 2006
@@ -502,7 +502,7 @@
   }
 }
 Configuration conf = NutchConfiguration.create();
-FileSystem fs = FileSystem.get(conf);
+final FileSystem fs = FileSystem.get(conf);
 SegmentReader segmentReader = new SegmentReader(conf, co, fe, ge, pa, pd, 
pt);
 // collect required args
 switch (mode) {
@@ -529,7 +529,9 @@
 File dir = new File(args[++i]);
 File[] files = fs.listFiles(dir, new FileFilter() {
   public boolean accept(File pathname) {
-if (pathname.isDirectory()) return true;
+try {
+  if (fs.isDirectory(pathname)) return true;
+} catch (IOException e) {};
 return false;
   }
 });




CommerceNet Events » Blog Archive » T 3 5/11: Stefan Groschupf on Extending Nutch

2006-05-05 Thread Doug Cutting

It seems Stefan is giving a talk...

http://events.commerce.net/?p=58

Doug


Re: mapred question

2006-05-02 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

As far as we understood from MapRed documentation all reduce tasks must be
launched after last map task is finished e.g map and reduce must not work
simultaneously. But often in logs we see such records: map 80%, reduce 10%
and many more records where map is less then 100% but reduce more than 0%.
How should we interpret this?


Hadoop includes the shuffle stage in reduce.  Currently, first 25% of 
a reduce task's progress is copying map outputs to the reduce node. 
These copies can start as soon as any map tasks completes, so that, when 
the last map task completes there is very little data remaining to be 
copied, and the rest of the reduce work can quickly start.


Doug


Re: Content-Type inconsistency?

2006-05-02 Thread Doug Cutting

Jérôme Charron wrote:

We had to turn off
the guessing of content types to index Apache correctly.


Instead of turning off the guessing of content types you should only to
remove the magic for xml in mime-types.xml


Perhaps that would have worked also, but, with Apache, simply trusting 
the declared Content-Type seems to work quite well.



I think we
shouldn't aim guess things any more than a browser does.  If browsers
require standards compliance, then our lives will be simpler.


Yes, but actually Nutch cannot acts as a browser.
For instance with RSS: A browser know that a URL is a RSS feed because 
there

is a link rel=alternate type=.../
with the correct content-type (application/rss+xml) in the refering HTML
page.
Nutch doesn't keep such informations for guessing a content-type (it could
be a good think to add), so it must find the content-type from the URL
(without any context).


Shouldn't RSS feeds declare the correct content-type?

http://feedvalidator.org/docs/warning/NonSpecificMediaType.html

I don't see that context should be required for feeds.

Doug


[jira] Commented: (NUTCH-257) Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

2006-04-28 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-257?page=comments#action_12376989 ] 

Doug Cutting commented on NUTCH-257:


I'd vote to never have Summary#toString() perform entity encoding, to fix 
search.jsp to encode things itself, and *not* to add a new 
Summary#toEntityEncodedString() method.

 Summary#toString always Entity encodes -- problem for 
 OpenSearchServlet#description field
 -

  Key: NUTCH-257
  URL: http://issues.apache.org/jira/browse/NUTCH-257
  Project: Nutch
 Type: Bug

   Components: searcher
 Versions: 0.8-dev
 Reporter: [EMAIL PROTECTED]
 Priority: Minor


 All search result data we display in search results has to be explicitly 
 Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries.  
 Its already Entity.encoded.  This is fine when outputing HTML but it gets in 
 the way when outputing otherwise -- as xml for example.  I'd suggest we not 
 make any presumption about how search results are used.
 The problem becomes especially acute when the text language is other than 
 english.
 Here is an example of what a Czech description field in an OpenSearchServlet 
 hit record looks like:
 descriptionlt;span class=ellipsisgt; ... 
 lt;/spangt;Vamp;#283;deckamp;aacute; knihovna v Olomouci 
 Bezruamp;#269;ova 2, Olomouc 9, 779 11, amp;#268;eskamp;aacute; republika 
 amp;nbsp; tel. +420-585223441 amp;nbsp; fax +420-585225774 
 http://www.lt;span class=highlightgt;vkollt;/spangt;.cz/ 
 amp;nbsp;amp;nbsp; mailto:info@lt;span 
 class=highlightgt;vkollt;/spangt;.cz Otevamp;#345;eno : amp;nbsp; 
 po-pamp;aacute; amp;nbsp; 8 30 -19 00 amp;nbsp;amp;nbsp;amp;nbsp; so 
 amp;nbsp; 9 00 -13 00 amp;nbsp;amp;nbsp;amp;nbsp; ne amp;nbsp; 
 zavamp;#345;eno V katalogu s amp;uacute;plnamp;yacute;m 
 amp;#269;asovamp;yacute;mlt;span class=ellipsisgt; ... lt;/spangt;03 
 Organizace 20/12 Odkazy 19/04 Hledej 23/03 amp;nbsp; 23/03 amp;nbsp; 
 Poamp;#269;et pamp;#345;amp;iacute;stupamp;#367; od 1.9.1998. Statistiky 
 . [ ] amp;nbsp; [ Nahoru ] lt;span 
 class=highlightgt;VKOLlt;/spangt;/description
 Here is same description field with Entity.encoding disabled:
 descriptionlt;span class=ellipsisgt; ... lt;/spangt;tisky statistiky 
 knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? 
 hlavní Internet N?mecké knihovny vázaných novin SVKOL viz lt;span 
 class=highlightgt;VKOLlt;/spangt; ?atna T telefonní ?ísla knihovny 
 zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video 
 lt;span class=highlightgt;VKOLlt;/spangt; volný výb?r výp?j?ka výro?ní 
 zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní 
 periodika zpracování fondult;span class=highlightgt;VKOLlt;/spangt; - 
 hledej Hledej [ lt;span class=highlightgt;VKOLlt;/spangt; ] [ Novinky ] 
 [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální 
 fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové 
 vyhledávání, 19/04/2003 rejst?ík vybranýchlt;span class=ellipsisgt; ... 
 lt;/spangt;/description
 Notice how the Czech characters in the first are all numerically encoded: 
 i.e. #NNN;.
 I'd suggest that Summary#toString() become Summary#toEntityEncodedString() 
 and that toString return raw aggregation of Fragments.  Would likely require 
 adding methods to the HitSummarizer interface so can ask for either raw text 
 or entity encoded with addition to NutchBean so can ask for either.  Or, 
 better I'd suggest is that Summarizer never return Entity.encoded text.  Let 
 that happen in search.jsp (I can make patch to do the latter if its amenable).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: exception

2006-04-27 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

We updated hadoop from trunk branch. But now we get new errors:


Oops.  Looks like I introduced a bug yesterday.  Let me fix it...

Sorry,

Doug


Re: TRUNK IllegalArgumentException: Argument is not an array (WAS: Re: exception)

2006-04-27 Thread Doug Cutting

I just fixed this.  Sorry for the inconvenience!

Doug

Michael Stack wrote:
I'm getting same as Anton below trying to launch a new job with latest 
from TRUNK.


Logic in ObjectWriteable#readObject seems a little off.  On the way in 
we test for a null instance.  If null, we set to NullWriteable.


Next we test declaredClass to see if its an array.  We then try to do an 
Array.getLength on instance -- which we've above set as NullWriteable.


Looks like we should test instance to see if its NullWriteable before we 
do the Array.getLength (or do the instance null check later).


Hope above helps,
St.Ack



[EMAIL PROTECTED] wrote:


We updated hadoop from trunk branch. But now we get new errors:

On tasktarcker side:
skiped
java.io.IOException: timed out waiting for response
at org.apache.hadoop.ipc.Client.call(Client.java:305)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:149)
at 
org.apache.hadoop.mapred.$Proxy0.pollForTaskWithClosedJob(Unknown

Source)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:310)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:374)
at 
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:813)

060427 062708 Client connection to 10.0.0.10:9001 caught:
java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
java.lang.RuntimeException: java.lang.ClassNotFoundException:
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:152)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:139)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:186)
at
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:60)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:170)
060427 062708 Client connection to 10.0.0.10:9001: closing


On jobtracker side:
skiped
060427 061713 Server handler 3 on 9001 caught:
java.lang.IllegalArgumentException: Ar
gument is not an array
java.lang.IllegalArgumentException: Argument is not an array
at java.lang.reflect.Array.getLength(Native Method)
at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:92)
at 
org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:250)
skiped

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, April 
27, 2006 12:48 AM

To: nutch-dev@lucene.apache.org
Subject: Re: exception
Importance: High

This is a Hadoop DFS error.  It could mean that you don't have any 
datanodes running, or that all your datanodes are full.  Or, it could 
be a bug in dfs.  You might try a recent nightly build of Hadoop to 
see if it works any better.


Doug

Anton Potehin wrote:
 


What means error of following type :

 


java.rmi.RemoteException: java.io.IOException: Cannot obtain additional
block for file /user/root/crawl/indexes/index/_0.prx

 

 








  





Re: Content-Type inconsistency?

2006-04-27 Thread Doug Cutting

Jérôme Charron wrote:

Finaly it is a good news that Nutch seems to be more intelligent on
content-type guessing than Firefox or IE, no?


I'm not so sure.  When crawling Apache we had trouble with this feature. 
 Some HTML files that had an XML header and the server identified as 
text/html Nutch decided to treat as XML, not HTML.  We had to turn off 
the guessing of content types to index Apache correctly.  I think we 
shouldn't aim guess things any more than a browser does.  If browsers 
require standards compliance, then our lives will be simpler.


Doug


Re: exception

2006-04-26 Thread Doug Cutting
This is a Hadoop DFS error.  It could mean that you don't have any 
datanodes running, or that all your datanodes are full.  Or, it could be 
a bug in dfs.  You might try a recent nightly build of Hadoop to see if 
it works any better.


Doug

Anton Potehin wrote:

What means error of following type :

 


java.rmi.RemoteException: java.io.IOException: Cannot obtain additional
block for file /user/root/crawl/indexes/index/_0.prx

 

 





[jira] Resolved: (NUTCH-250) Generate to log truncation caused by generate.max.per.host

2006-04-20 Thread Doug Cutting (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-250?page=all ]
 
Doug Cutting resolved NUTCH-250:


Fix Version: 0.8-dev
 Resolution: Fixed
  Assign To: Doug Cutting

I just committed this.  Thanks, Rod.

 Generate to log truncation caused by generate.max.per.host
 --

  Key: NUTCH-250
  URL: http://issues.apache.org/jira/browse/NUTCH-250
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Rod Taylor
 Assignee: Doug Cutting
  Fix For: 0.8-dev
  Attachments: nutch-generate-truncatelog.patch

 LOG.info() hosts which have had their generate lists truncated.
 This can inform admins about potential abusers or excessively large sites 
 that they may wish to block with rules.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: mapred.map.tasks

2006-04-20 Thread Doug Cutting

Anton Potehin wrote:

We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so? 


It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that 
all of the task slots are used.  More tasks makes recovery faster when a 
task fails, since less needs to be redone.



Our spider is distributed among 3 machines. What value is most preferred
for this parameter in our case? Which other factors may have effect on
most preferred value of this parameter?  


When fetching, the total number of hosts you're fetching can also be a 
factor, since fetch tasks are hostwise-disjoint.  If you're only 
fetching a few hosts, then a large value for mapred.map.tasks will cause 
there to be a few big fetch tasks and a bunch of empty ones.  This could 
be a problem if the big ones are not allocated evenly among your nodes.


I generally use 5*numHosts*mapred.tasktracker.tasks.maximum.

Doug


Re: jobtaraker and tasktracker

2006-04-19 Thread Doug Cutting

Anton Potehin wrote:

Are there any ways to rotate these logs ?


One way would be to configure the JVM to use a rolling FileHandler:

file:///home/cutting/local/jdk1.5-docs/api/java/util/logging/FileHandler.html

This should be possible by setting HADOOP_OPTS (in conf/hadoop-env.sh) 
and NUTCH_OPTS to include something like:


 -Djava.util.logging.config.file=myfile

The default logging config file is in your JVM, at 
jre/lib/logging.properties.


I have not in fact tried this.  If you do, please tell how it works.

Doug


Re: question about crawldb

2006-04-18 Thread Doug Cutting

Anton Potehin wrote:
1.	We have found these flags in CrawlDatum class: 


  public static final byte STATUS_SIGNATURE = 0;
  public static final byte STATUS_DB_UNFETCHED = 1;
  public static final byte STATUS_DB_FETCHED = 2;
  public static final byte STATUS_DB_GONE = 3;
  public static final byte STATUS_LINKED = 4;
  public static final byte STATUS_FETCH_SUCCESS = 5;
  public static final byte STATUS_FETCH_RETRY = 6;
  public static final byte STATUS_FETCH_GONE = 7;

Though the names of these flags describe their aims, it is not clear
completely what they mean and what is the difference between
STATUS_DB_FETCHED and STATUS_FETCH_SUCCESS for example.


The STATUS_DB_* codes are used in entries in the crawldb. 
STATUS_FETCH_* codes are used in fetcher output.  STATUS_LINKED is used 
in parser output for urls that are linked to.  A crawldb update combines 
all of these (the old version of the db, plus fetcher and parser output) 
to generate a new version of the db, containing only STATUS_DB_* 
entries.  This logic is in CrawlDbReducer.


Does that help?

Doug


Re: Duplicate Detection: Offlince vs. Search Time

2006-04-17 Thread Doug Cutting

Shailesh Kochhar wrote:
If I understand this correctly, you can only dedup by one field. This 
would mean that if you were to implement and use content-based 
deduplication, you'd have to give up limiting the number of hits per host.


Is this correct, or did I miss something?


That's correct.  That's what's currently implemented.

Doug


Re: svn commit: r394228 - in /lucene/nutch/trunk: ./ src/java/org/apache/nutch/plugin/ src/plugin/ src/plugin/analysis-de/ src/plugin/analysis-fr/ src/plugin/clustering-carrot2/ src/plugin/creativecom

2006-04-17 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

+!-- Copy the plugin.dtd file to the plugin doc-files dir --
+copy file=${plugins.dir}/plugin.dtd
+  todir=${src.dir}/org/apache/nutch/plugin/doc-files/


The build should not make changes to the source tree.  The source tree 
should be read-only to the build.  All changes during build should be 
confined to the build directory.


Is this just needed for references from javadoc?  If so, then this can 
be copied to build/docs, no?


Doug


[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374272 ] 

Doug Cutting commented on NUTCH-246:


 It seems like the Injector should be loading the current time from a job 
 configuration property in the same way that that the Generator is doing [...]

That sounds like a good plan.  Will you construct a patch for this?


 segment size is never as big as topN or crawlDB size in a distributed 
 deployement
 -

  Key: NUTCH-246
  URL: http://issues.apache.org/jira/browse/NUTCH-246
  Project: Nutch
 Type: Bug

 Versions: 0.8-dev
 Reporter: Stefan Groschupf
 Priority: Minor
  Fix For: 0.8-dev


 I didn't reopen NUTCH-136 since it is may related to the hadoop split.
 I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
 and 9 ttracks and 1 jobtracker).
 Defining map and reduce task number in a mapred-default.xml does not solve 
 the problem. (is in nutch/conf on all boxes)
 We verified that it is not  a problem of maximum urls per hosts and also not 
 a problem of the url filter.
 Looks like the first job of the Generator (Selector) already got to less 
 entries to process. 
 May be this is somehow releasted to split generation or configuration inside 
 the distributed jobtracker since it runs in a different jvm as the jobclient.
 However we was not able to find the source for this problem.
 I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Doug Cutting

Chris Mattmann wrote:

+1 for a release sooner rather than later.


I think this is a good plan.  There's no reason we can't do another 
release in a month.  If it is back-compatbible we can call it 0.8.x and 
if it's incompatible we can call it 0.9.0.


I'm going to make a Hadoop 0.1.1 release today that can be included in 
Nutch 0.8.0.  (With Hadoop we're going to aim for monthly releases, with 
potential bugfix releases between when serious bugs are found.  The big 
bug in Hadoop 0.1.0 is http://issues.apache.org/jira/browse/HADOOP-117.)


So we could aim for a Nutch 0.8.0 release sometime next week.  Does that 
work for folks?


Piotr, would you like to make this release, or should I?

Doug


Re: CrawlDbReducer - selecting data for DB update

2006-04-07 Thread Doug Cutting

Andrzej Bialecki wrote:
This selection is primarily made in the while() loop in 
CrawlDbReducer:45. My main objection is that selecting the highest 
value (meaning most recent) relies on the fact that values of status 
codes in CrawlDatum are ordered according to their meaning, and they are 
treated as a sort of state machine.


Yes, that was the design, that status codes are also priorities.

However, adding new states is very 
difficult, if they should have values lower than STATUS_FETCH_GONE, as 
it leads to breaking backwards-compatibility with older segment data. 


We can use CrawlDatum.VERSION to insert new status codes 
back-compatibly.  Perhaps we should change the codes to, instead of [0, 
1, 2, ...] to be [0, 10, 20, 30, ...] so that we can more easily 
introduce new values?  To update status codes from older versions we 
simply multiply by 10.


Would something like that work?

Or we could have a separate table mapping status codes to priority.

Doug


Re: PMD integration

2006-04-07 Thread Doug Cutting

Piotr Kosiorowski wrote:

I will make it totally separate target (so test do not
depend on it).


That was actually Doug's idea (and I agree with it) to stop the build
file if PMD complains about something. It's similar to testing -- if
your tests fail, the entire build file fails.


I totally agree with it - but I want to switch it on for others to
play first, and when we agree on
rules we want to use make it obligatory.


So we start out comitting it as an independent target, and then add it 
to the test target?  Is that the plan?  If so, +1.


Doug


Re: web ui improvement

2006-04-07 Thread Doug Cutting

Sami Siren wrote:
I know there are people who think that a plain xml interface is good 
enough for all but I would like to give this new architecture a try.


I think this would be a great addition.  The XML has a lot of uses, but 
we should include a good native, extensible, skinnable search UI.  +1


As part of the required functionality of the 0.8 release discussion on 
some other thread my opinion is to postbone any new ui functionality

(for example NUTCH-48) until the new architecture is in place


I would not veto someone testing  committing NUTCH-48.  We should avoid 
investing too much effort into this if it will soon be obsolete.  But if 
a small effort will give folks did you mean support in the interim, 
that's not a bad thing.  Of course, folks can always apply this patch 
themselves...


Doug


0.8 release schedule (was Re: latest build throws error - critical)

2006-04-06 Thread Doug Cutting

TDLN wrote:

I mean, how do others keep uptodate with the main codeline? Do you
advice updating everyday?


Should we make a 0.8.0 release soon?  What features are still missing 
that we'd like to get into this release?


Doug


Re: Search quality evaluation

2006-04-05 Thread Doug Cutting
FYI, Mike wrote some evaluation stuff for Nutch a long time ago.  I 
found it in the Sourceforge Attic:


http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/Attic/

This worked by querying a set of search engines, those in:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/engines/

The results of each engine is scored by how much they differ from all of 
the other engines combined.  The Kendall Tau distance is used to compare 
rankings.  Thus this is a good tool to find out how close Nutch is to 
the quality of other engines, but it may not not be a good tool to make 
Nutch better than other search engines.


In any case, it includes a system to scrape search results from other 
engines, based on Apple's Sherlock search-engine descriptors.  These 
descriptors are also used by Mozilla:


http://mycroft.mozdev.org/deepdocs/quickstart.html

So there's a ready supply of up-to-date descriptions for most major 
search engines.  Many engines provide a skin specifically to simplify 
parsing by these plugins.


The code that implemented Sherlock plugins in Nutch is at:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/dynamic/

Doug

Andrzej Bialecki wrote:

Hi,

I found this paper, more or less by accident:

Scaling IR-System Evaluation using Term Relevance Sets; Einat Amitay, 
David Carmel, Ronny Lempel, Aya Soffer


   http://einat.webir.org/SIGIR_2004_Trels_p10-amitay.pdf

It gives an interesting and rather simple framework for evaluating the 
quality of search results.


Anybody interested in hacking together a component for Nutch and e.g. 
for Google, to run this evaluation? ;)




Re: Add .settings to svn:ignore on root Nutch folder?

2006-04-05 Thread Doug Cutting

Other options (raised on the Hadoop list) are Checkstyle:

http://checkstyle.sourceforge.net/

and FindBugs:

http://findbugs.sourceforge.net/

Although these are both under LGPL and thus harder to include in Apache 
projects.


Anything that generates a lot of false positives is bad: it either 
causes us to skip analysis of lots of files, or ignore the warnings. 
Skipping the JavaCC-generated classes is reasonable, but I'm wary of 
skipping much else.


Sigh.

Doug

Dawid Weiss wrote:


Ok, PMD seems like a good idea. I've added it to the build file. Unused 
code detection shows a few catches (javacc-generated classes need to be 
ignored because they contain a lot of junk), but unfortunately it also 
displays false positives such as in:


MapWritable.java   429   {Avoid unused private fields such as 
'fKeyClassId'}


This field is private but is used in an outside class (through a 
synthetic accessor I presume, so a simple syntax tree analysis PMD does 
is insufficient to catch it).


These things would need to be marked in the code as ignorable... Do you 
want me to create a JIRA issue for this, Doug? Or should we drop the 
subject? Oh, I forgot to say this: PMD's jars add a minimum of 1MB to 
the codebase (Xerces can be reused).


D.



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] 

Doug Cutting commented on NUTCH-240:


+1 for committing Generator.patch.txt now.

0 for committing the rest until I've had more time to think about it.  I'm not 
against it, but, at a glance, I'm still hopeful we can do better.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] 

Doug Cutting commented on NUTCH-240:


Also, note that we can now extend Hadoop's new MapReduceBase to implement 
configure() and close() for many Mappers and Reducers, including the one's in 
this patch.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Refactoring some plugins

2006-03-31 Thread Doug Cutting

Jérôme Charron wrote:

One more question about javadoc (I hope the last one):
Do you think it makes sense to split the plugins gathered into the Misc
group
into many plugins (such as index-more / query-more), so that each sub-plugin
can be dispatched into proper Group.


No, I don't think so.  These are strongly related bundles of plugins. 
When you change one chances are good you'll change the others, so it 
makes sense to keep their code together rather than split it up.  Folks 
can still find all implementations of an interface in the javadoc, just 
not always grouped together in the table of contents.


We could instead of calling these misc call them compound plugins or 
something.  We can change the package.html for each to list the 
coordinated set of plugins they provide.  For example, 
language-identifier's could say something like, Includes parse, index 
and query plugins to identify, index and make searchable the identified 
language.


Doug


  1   2   3   4   >