Build failed in Hudson: Nutch-trunk #1317

2010-11-25 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1317/

--
[...truncated 1003 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A src/plugin/parse-html/src/test/org/apache/nutch
A src/plugin/parse-html/src/test/org/apache/nutch/parse
A 

[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt

2010-11-25 Thread Enrique Berlanga (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935728#action_12935728
 ] 

Enrique Berlanga commented on NUTCH-938:


Thanks for your answer. I agree with you that Nutch as a project cannot 
encourage such practice, but maybe some code in Protocol or Fetcher class need 
to be removed from official source. If not, It's hard to understand why this 
lines appear in the main method of the class ...

// set non-blocking  no-robots mode for HTTP protocol plugins.
getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
getConf().setBoolean(Protocol.CHECK_ROBOTS, false);

... and later in fetcher thread that values are ignored.
Maybe some notes in crawl-urlfilter.txt showing these properties as deprecated 
would be great.

My question is: Is there any reason to force it to false? A well-behaved 
crawler that obeys robot rules and netiquette must force it to true, what makes 
me being a little confused about that part of the code. I would prefer to feel 
free to change the behaviour by changing protocol.plugin.check.robots value 
in crawl-urlfilter.txt file.
Thanks in advance

 Imposible to fetch sites with robots.txt 
 -

 Key: NUTCH-938
 URL: https://issues.apache.org/jira/browse/NUTCH-938
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.2
 Environment: red hat, nutch 1.2, jaca 1.6
Reporter: Enrique Berlanga
 Attachments: NUTCH-938.patch


 Crawling a site with a robots.txt file like this:  (e.g: 
 http://www.melilla.es)
 ---
 User-agent: *
 Disallow: /
 ---
 No links are followed. 
 It doesn't matters the value set at protocol.plugin.check.blocking or 
 protocol.plugin.check.robots properties, because they are overloaded in 
 class org.apache.nutch.fetcher.Fetcher:
 // set non-blocking  no-robots mode for HTTP protocol plugins.
 getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
 getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
 False is the desired value, but in FetcherThread inner class, robot rules are 
 checket ignoring the configuration:
 
 RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
 if (!rules.isAllowed(fit.u)) {
  ...
 LOG.debug(Denied by robots.txt:  + fit.url);
 ...
 continue;
 }
 ---
 I suposse there is no problem in disabling that part of the code directly for 
 HTTP protocol. If so, I could submit a patch as soon as posible to get over 
 this.
 Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-4.patch

Final version of the patch.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-932.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1039014.

 Bulk REST API to retrieve crawl results as JSON
 ---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0

 Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
 NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch


 It would be useful to be able to retrieve results of a crawl as JSON. There 
 are a few things that need to be discussed:
 * how to return bulk results using Restlet (WritableRepresentation subclass?)
 * what should be the format of results?
 I think it would make sense to provide a single record retrieval (by primary 
 key), all records, and records within a range. This incidentally matches well 
 the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Nutch-trunk #1318

2010-11-25 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1318/changes

Changes:

[ab] NUTCH-932 Bulk REST API to retrieve crawl results as JSON.

--
[...truncated 1006 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A src/plugin/parse-html/src/test
A src/plugin/parse-html/src/test/org
A src/plugin/parse-html/src/test/org/apache
A