[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860114#comment-13860114
 ] 

Markus Jelsma commented on NUTCH-1325:
--

Hi Tejas - i think most seems fine now and i like the changes you've made so 
far and i cannot come up with a better solution right now for the https:// 
schema filtering issue.

Or there any other issues we didn't think about? Anyone else 

 HostDB for Nutch
 

 Key: NUTCH-1325
 URL: https://issues.apache.org/jira/browse/NUTCH-1325
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.9

 Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
 NUTCH-1325.trunk.v2.path


 A HostDB for Nutch and associated tools to create and read a database 
 containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860121#comment-13860121
 ] 

Markus Jelsma commented on NUTCH-1080:
--

+1!

 Type safe members , arguments for better readability 
 -

 Key: NUTCH-1080
 URL: https://issues.apache.org/jira/browse/NUTCH-1080
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Karthik K
 Fix For: 2.3

 Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
 NUTCH-rel_14-1080.patch


 Enable generics for some of the API, for better type safety and readability, 
 in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860124#comment-13860124
 ] 

Markus Jelsma commented on NUTCH-1670:
--

+1

 set same crawldb directory in mergedb parameter
 ---

 Key: NUTCH-1670
 URL: https://issues.apache.org/jira/browse/NUTCH-1670
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
  Labels: PatchAvailable
 Fix For: 1.8

 Attachments: NUTCH-1670.patch


 when merge two crawldb using the same crawldb directory in bin/nutch merge 
 paramater, it will throw data not found exception. 
 bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
 bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860140#comment-13860140
 ] 

Markus Jelsma commented on NUTCH-1360:
--

Almost all unit tests fail due to improper use of entities in configuration.

{code}
org.xml.sax.SAXParseException; systemId: 
file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; 
lineNumber: 32; columnNumber: 60; The entity name must immediately follow the 
'' in the entity reference.
java.lang.RuntimeException: org.xml.sax.SAXParseException; systemId: 
file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; 
lineNumber: 32; columnNumber: 60; The entity name must immediately follow the 
'' in the entity reference.
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1249)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1117)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1053)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:460)
at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:131)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123)
at 
org.apache.nutch.crawl.TestCrawlDbFilter.setUp(TestCrawlDbFilter.java:50)
Caused by: org.xml.sax.SAXParseException; systemId: 
file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; 
lineNumber: 32; columnNumber: 60; The entity name must immediately follow the 
'' in the entity reference.
at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:251)
at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1156)
{code}

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Reopened] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reopened NUTCH-1360:
--


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860142#comment-13860142
 ] 

Markus Jelsma commented on NUTCH-1360:
--

{code}
--- conf/nutch-default.xml  (revision 1554785)
+++ conf/nutch-default.xml  (working copy)
@@ -29,7 +29,7 @@
   valuefalse/value
   descriptionEnables us to capture the specific IP address 
   (InetSocketAddress) of the host which we connect to via 
-  the given protocol. Currently supported is protocol-ftp  
+  the given protocol. Currently supported is protocol-ftp and
   http.
   /description
 /property
{code}

will commit shortly

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-02 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-356:


Attachment: NUTCH-356-trunk.patch

Updated patch for trunk. All tests pass.

According to 
http://lucene.472066.n3.nabble.com/Memory-leak-when-crawling-repeatedly-td4106960.html
 this patch should resolve the issue.

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: https://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Fix For: 2.3, 1.8

 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, 
 ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch, cache_classes.patch


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860153#comment-13860153
 ] 

Markus Jelsma commented on NUTCH-1360:
--

Committed revision 1554791.


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.9

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1360.
--

Resolution: Fixed

This issue is not in 2.x, just trunk. All tests pass again.

 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Build failed in Jenkins: Nutch-trunk #2472

2014-01-02 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2472/changes

Changes:

[markus] NUTCH-1360 fix entity in configuration

--
[...truncated 6752 lines...]

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlmeta

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-querystring

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-regex

jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

compile:

javadoc:
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.nutch.crawl...
  [javadoc] Loading source files for package org.apache.nutch.fetcher...
  [javadoc] Loading source files for package org.apache.nutch.indexer...
  [javadoc] Loading source files for package org.apache.nutch.metadata...
  [javadoc] Loading source files for package org.apache.nutch.net...
  [javadoc] Loading source files for package org.apache.nutch.net.protocols...
  [javadoc] Loading source files for package org.apache.nutch.parse...
  [javadoc] Loading source files for package org.apache.nutch.plugin...
  [javadoc] Loading source files for package org.apache.nutch.protocol...
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc] Loading source files for package org.apache.nutch.scoring...
  [javadoc] Loading source files for package 
org.apache.nutch.scoring.webgraph...
  [javadoc] Loading source files for package org.apache.nutch.segment...
  [javadoc] Loading source files for package org.apache.nutch.tools...
  [javadoc] Loading source files for package org.apache.nutch.tools.arc...
  [javadoc] Loading source files for package org.apache.nutch.tools.proxy...
  [javadoc] Loading source files for package org.apache.nutch.util...
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] Loading source files for package org.apache.nutch.util.domain...
  [javadoc] Loading source files for package org.creativecommons.nutch...
  [javadoc] Loading source files for package org.apache.nutch.indexer.feed...
  [javadoc] Loading source files for package org.apache.nutch.parse.feed...
  [javadoc] Loading source files for package org.apache.nutch.parse.headings...
  [javadoc] ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc]  ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc]   ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133:
 error: unmappable character for encoding ASCII
  [javadoc] return value.replaceAll(???, );
  [javadoc]  ^
  [javadoc] 

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2014-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860222#comment-13860222
 ] 

Hudson commented on NUTCH-1360:
---

FAILURE: Integrated in Nutch-trunk #2472 (See 
[https://builds.apache.org/job/Nutch-trunk/2472/])
NUTCH-1360 fix entity in configuration (markus: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554791)
* /nutch/trunk/conf/nutch-default.xml


 Suport the storing of IP address connected to when web crawling
 ---

 Key: NUTCH-1360
 URL: https://issues.apache.org/jira/browse/NUTCH-1360
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, 
 NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, 
 NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, 
 NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, 
 NUTCH-1360v5.patch


 Simple issue enabling us to capture the specific IP address of the host which 
 we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1691:


 Summary: DomainBlacklist url filter does not allow -D filter file 
override
 Key: NUTCH-1691
 URL: https://issues.apache.org/jira/browse/NUTCH-1691
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8, 2.4


This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860281#comment-13860281
 ] 

Markus Jelsma commented on NUTCH-1691:
--

This means existing behaviour is unchanged, the defaults are still the same.

 DomainBlacklist url filter does not allow -D filter file override
 -

 Key: NUTCH-1691
 URL: https://issues.apache.org/jira/browse/NUTCH-1691
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8, 2.4

 Attachments: NUTCH-1691-trunk.patch


 This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
 plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1691:
-

Attachment: NUTCH-1691-trunk.patch

Patch for trunk. This fixes the issue by defaulting it in nutch-default and 
commenting out the file attribute in plugin.xml.

 DomainBlacklist url filter does not allow -D filter file override
 -

 Key: NUTCH-1691
 URL: https://issues.apache.org/jira/browse/NUTCH-1691
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8, 2.4

 Attachments: NUTCH-1691-trunk.patch


 This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
 plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860288#comment-13860288
 ] 

Markus Jelsma commented on NUTCH-1691:
--

Well, there is a small issue now:
{code}
WARN  domainblacklist.DomainBlacklistURLFilter - Attribute file is not 
defined in plugin.xml for plugin urlfilter-domainblacklist
{code}

In my opinion we can remove the INFO and WARN code.

{code}
if (attributeFile != null) {
  if (LOG.isInfoEnabled()) {
LOG.info(Attribute \file\ is defined for plugin  + pluginName
  +  as  + attributeFile);
  }
}
else {
  if (LOG.isWarnEnabled()) {
LOG.warn(Attribute \file\ is not defined in plugin.xml for plugin 
  + pluginName);
  }
}

{code}

And only show an ERROR if there is are no rules to work with.

What do you think?

 DomainBlacklist url filter does not allow -D filter file override
 -

 Key: NUTCH-1691
 URL: https://issues.apache.org/jira/browse/NUTCH-1691
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8, 2.4

 Attachments: NUTCH-1691-trunk.patch


 This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
 plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-02 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-1692:


 Summary: SegmentReader broken in distributed mode
 Key: NUTCH-1692
 URL: https://issues.apache.org/jira/browse/NUTCH-1692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8
 Attachments: NUTCH-1692-trunk.patch

SegmentReader -list option ignores the -no* options, causing the following 
exception in distributed mode:

{code}
Exception in thread main java.lang.NullPointerException
at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
at java.util.Arrays.sort(Arrays.java:472)
at 
org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
at 
org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-02 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1692:
-

Attachment: NUTCH-1692-trunk.patch

Patch for trunk. Fix works, issue is gone.

 SegmentReader broken in distributed mode
 

 Key: NUTCH-1692
 URL: https://issues.apache.org/jira/browse/NUTCH-1692
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8

 Attachments: NUTCH-1692-trunk.patch


 SegmentReader -list option ignores the -no* options, causing the following 
 exception in distributed mode:
 {code}
 Exception in thread main java.lang.NullPointerException
 at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
 at java.util.Arrays.sort(Arrays.java:472)
 at 
 org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
 at 
 org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
 at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
 at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
 {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860643#comment-13860643
 ] 

Tejas Patil commented on NUTCH-1080:


Committed to trunk (rev 1554881). Will port the same to 2.x

 Type safe members , arguments for better readability 
 -

 Key: NUTCH-1080
 URL: https://issues.apache.org/jira/browse/NUTCH-1080
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Karthik K
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
 NUTCH-rel_14-1080.patch


 Enable generics for some of the API, for better type safety and readability, 
 in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860678#comment-13860678
 ] 

Tejas Patil commented on NUTCH-1691:


Hi [~markus17],
Its a good solution. +1 from me. 
I would like to know the way you are invoking the plugin. I tried to use 
bin/nutch plugin urlfilter-domainblacklist but that didn't work as it doesn't 
got main().

 DomainBlacklist url filter does not allow -D filter file override
 -

 Key: NUTCH-1691
 URL: https://issues.apache.org/jira/browse/NUTCH-1691
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.8, 2.4

 Attachments: NUTCH-1691-trunk.patch


 This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
 plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860739#comment-13860739
 ] 

Hudson commented on NUTCH-1080:
---

FAILURE: Integrated in Nutch-trunk #2473 (See 
[https://builds.apache.org/job/Nutch-trunk/2473/])
NUTCH-1080 Type safe members, arguments for better readability (tejasp: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554881)
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
* 
/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthentication.java
* 
/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationFactory.java
* 
/nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java
* 
/nutch/trunk/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
* 
/nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
* 
/nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
* 
/nutch/trunk/src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java
* 
/nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java


 Type safe members , arguments for better readability 
 -

 Key: NUTCH-1080
 URL: https://issues.apache.org/jira/browse/NUTCH-1080
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Karthik K
Assignee: Tejas Patil
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
 NUTCH-rel_14-1080.patch


 Enable generics for some of the API, for better type safety and readability, 
 in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860740#comment-13860740
 ] 

Hudson commented on NUTCH-1670:
---

FAILURE: Integrated in Nutch-trunk #2473 (See 
[https://builds.apache.org/job/Nutch-trunk/2473/])
NUTCH-1670 set same crawldb directory in mergedb parameter (tejasp: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554883)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java


 set same crawldb directory in mergedb parameter
 ---

 Key: NUTCH-1670
 URL: https://issues.apache.org/jira/browse/NUTCH-1670
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.7
Reporter: lufeng
Assignee: lufeng
Priority: Minor
  Labels: PatchAvailable
 Fix For: 1.8

 Attachments: NUTCH-1670.patch


 when merge two crawldb using the same crawldb directory in bin/nutch merge 
 paramater, it will throw data not found exception. 
 bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
 bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1454) parsing chm failed

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860803#comment-13860803
 ] 

Tejas Patil commented on NUTCH-1454:


TIKA-1122 is fixed and I have verified that 'parsechecker' works fine with the 
same. Upgrading to Tika 1.5 (yet to be released) should fix this for Nutch.

 parsing chm failed
 --

 Key: NUTCH-1454
 URL: https://issues.apache.org/jira/browse/NUTCH-1454
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5.1
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.9


 (reported by Jan Riewe, see 
 http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
 Nutch fails to parse chm files with
 {quote}
  ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
 application/vnd.ms-htmlhelp
 {quote}
 Tested with chm test files from Tika:
 {code}
  % bin/nutch parsechecker 
 file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
 {code}
 Tika parses this document (but does not extract any content).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Nutch Crawl a Specific List Of URLs (150K)

2014-01-02 Thread Bin Wang
Thanks for all the response, they are very inspiring and diving into the
log level is very beneficial to learn Nutch.

The fact is that I use Python BeautifulSoup to parse the sitemap of my
targeted website, which comes up with those 150K URLs, however, it turned
out that there are many many duplicates which actually in the end turned
out to be 900 distinct URLs.

And Nutch is smart enough to filter out those duplicates and come up with
900 before hitting their websites.



On Mon, Dec 30, 2013 at 4:13 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 You ran one crawl cycle. Depending on the generator and fetcher settings
 you are not guaranteerd to fetch 200.000 URL's with only topN specified.
 Check the logs, the generator will tell you if there are too many URL's for
 a host or domain. Also check all fetcher logs, it will tell you how much it
 crawled and why it likely stopped when it did.

 Cheers

 -Original message-
 From: Bin Wangbinwang...@gmail.com
 Sent: Friday 27th December 2013 19:50
 To: dev@nutch.apache.org
 Subject: Nutch Crawl a Specific List Of URLs (150K)

 Hi,

 I have a very specific list of URLs, which is about 140K URLs.

 I switch off the `db.update.additions.allowed` so it will not update the
 crawldb... and I was assuming I can feed all the URLs to Nutch, and after
 one round of fetching, it will finish and leave all the raw HTML files in
 the segment folder.

 However, after I run this command:

 nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 

 It ended up with a small number of URLs..

 TOTAL urls: 872

 retry 0:872

 min score:  1.0

 avg score:  1.0

 max score:  1.0

 And I double check the log to make sure that every url can pass the filter
 and normalization. And here is the log:

 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
 urls rejected by filters: 0

 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
 urls injected after normalization and filtering: 139058

 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
 urls into crawl db.

 I dont know how 140K URLs ended up being 872 in the end...

 /usr/bin

 --

 AWS ubuntu instance

 Nutch 1.7

 java version 1.6.0_27

 OpenJDK Runtime Environment (IcedTea6 1.12.6)
 (6b27-1.12.6-1ubuntu0.12.04.4)

 OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)





use Map Reduce + Jsoup to parse big Nutch/Content file

2014-01-02 Thread Bin Wang
Hi,

I have a robot that scrapes a website daily and store the HTML locally so
far(in nutch binary format in segment/content folder).

The size of the scraping is fairly big. Million pages per day.
One thing about the HTML pages themselves is that they follow exactly the
same format.. so I can write a parser in Java to parse out the info I want
(say unit price, part number...etc) for one page, and that parser will work
for most of the pages..

I am wondering is there some map reduce template already written so I can
just replace the parser with my customized one and easily start a hadoop
mapreduce job. (actually, there doesn't have to be any reduce job... in
this case, we map every page to the parsed result and that is it...)

I was looking at the map reduce example here:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
But I have some problem translating that into my real-world nutch problem.

I know run map reduce against Nutch binary file will be a bit different
than word count. I looked at the source code of Nutch and to me, it looks
like the file are a sequence files of records where each records is a
key/value pair where key is text type and value is
org.apache.nutch.protocol.Content type. Then how should I configure the map
job so it can read in the raw big content binary file and do the Inputsplit
correctly and run the map job..

Thanks a lot!

/usr/bin


( Some explanations of why I decided not to write Java plugin ):
I was thinking about writing a Nutch Plugin so it will be handy to parse
the scraped data using Nutch command. However, the problem here is it is
hard to write a perfect parser in one go. It probably makes a lot of sense
for the people who deal with parsers a lot. You locate your HTML tag by
some specific features that you think will be general... css class type,
id...etc...even combining with regular expression. However, when you apply
your logic to all the pages, it won't stand true for all the pages. Then
you need to write many different parsers to run against the whole dataset
(Million pages) in one go and see which one has the best performance. Then
you run your parser against all your snapshots days * million pages.. to
get the new dataset.. )


[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2014-01-02 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861142#comment-13861142
 ] 

Tien Nguyen Manh commented on NUTCH-1686:
-

In this patch i also fixed an bug with fetchTime. Currently each time we run 
whole updatedb, fetchTime is increased again for all urls.

 Optimize UpdateDb to load less field from Store
 ---

 Key: NUTCH-1686
 URL: https://issues.apache.org/jira/browse/NUTCH-1686
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1686.patch


 While running large crawl i found that updatedb run very slow, especially the 
 Map task which loading data from store.
 We can't use filter by batchId to load less url due to bug in NUTCH-1679 so 
 we must always update the whole table.
 After checking the field loaded in UpdateDbJob i found that it load many 
 fields from store (at least 15/25 field) which make updatedb slow
 I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, 
 METADATA which is used to compute link score, distance that i think the main 
 purpose of this job.
 The other fields is used to compute url schedule to parser and fetcher, we 
 can move code to Parser or Fetcher whithout loading much new field because 
 many field are generated from parser. WE can also use gora filter for Fetcher 
 or Parser so load new field is not a problem.
 I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is 
 currently store in METADATA. field CASH is used in OPICScoring which is used 
 only in UpdateDB and distance is used only in Generator and Updater so move 
 both field two new Metadata field can prevent reading METADATA at Generator 
 and Updater, METADATA contains many data that is used only at Parser and 
 Indexer
 So with new change
 UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we 
 don't need to load big family Fetch and INLINKS.
 Generator only load SCOREMETA (which is smaller than current METADATA)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Build failed in Jenkins: Nutch-trunk #2474

2014-01-02 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2474/

--
[...truncated 6749 lines...]
deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlmeta

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-pass

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-querystring

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-regex

jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

compile:

javadoc:
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package org.apache.nutch.crawl...
  [javadoc] Loading source files for package org.apache.nutch.fetcher...
  [javadoc] Loading source files for package org.apache.nutch.indexer...
  [javadoc] Loading source files for package org.apache.nutch.metadata...
  [javadoc] Loading source files for package org.apache.nutch.net...
  [javadoc] Loading source files for package org.apache.nutch.net.protocols...
  [javadoc] Loading source files for package org.apache.nutch.parse...
  [javadoc] Loading source files for package org.apache.nutch.plugin...
  [javadoc] Loading source files for package org.apache.nutch.protocol...
  [javadoc] Loading source files for package org.apache.nutch.scoring...
  [javadoc] Loading source files for package 
org.apache.nutch.scoring.webgraph...
  [javadoc] Loading source files for package org.apache.nutch.segment...
  [javadoc] Loading source files for package org.apache.nutch.tools...
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc] ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc]  ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130:
 error: unmappable character for encoding ASCII
  [javadoc]* Simple character substitution which cleans all ??? chars from 
a given String.
  [javadoc]   ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133:
 error: unmappable character for encoding ASCII
  [javadoc] return value.replaceAll(???, );
  [javadoc]  ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133:
 error: unmappable character for encoding ASCII
  [javadoc] return value.replaceAll(???, );
  [javadoc]   ^
  [javadoc] 
https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133:
 error: unmappable character for encoding ASCII
  [javadoc] return value.replaceAll(???, );
  [javadoc]^
  [javadoc] Loading source files for package org.apache.nutch.tools.arc...
  [javadoc] Loading source files for package org.apache.nutch.tools.proxy...
  [javadoc] Loading 

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1693:


Issue Type: New Feature  (was: Bug)

 TextMD5Signatue compute on textual content
 --

 Key: NUTCH-1693
 URL: https://issues.apache.org/jira/browse/NUTCH-1693
 Project: Nutch
  Issue Type: New Feature
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1693.patch


 I create a new MD5Signature that based on textual content. In our case we use 
 boilerpipe to extract main text from content so this signature is more 
 effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1693:


Fix Version/s: 2.3

 TextMD5Signatue compute on textual content
 --

 Key: NUTCH-1693
 URL: https://issues.apache.org/jira/browse/NUTCH-1693
 Project: Nutch
  Issue Type: New Feature
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1693.patch


 I create a new MD5Signature that based on textual content. In our case we use 
 boilerpipe to extract main text from content so this signature is more 
 effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861195#comment-13861195
 ] 

Tien Nguyen Manh commented on NUTCH-1693:
-

this patch only work with a minor change that compute signature after seting 
text to page that i made in NUTCH-1686

 TextMD5Signatue compute on textual content
 --

 Key: NUTCH-1693
 URL: https://issues.apache.org/jira/browse/NUTCH-1693
 Project: Nutch
  Issue Type: New Feature
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1693.patch


 I create a new MD5Signature that based on textual content. In our case we use 
 boilerpipe to extract main text from content so this signature is more 
 effective to deduplicate.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: use Map Reduce + Jsoup to parse big Nutch/Content file

2014-01-02 Thread Tejas Patil
Here is what I would do:
If you running a crawl, let it run with the default parser. Write a nutch
plugin with your customized parse implementation to evaluate your parse
logic. Now get some real segments (with a subset of those million pages)
and run only the 'bin/nutch parse' command and see how good it is. That
command will run your parser over the segment. Do this till you get a
satisfactory parser implementation.

~tejas


On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang binwang...@gmail.com wrote:

 Hi,

 I have a robot that scrapes a website daily and store the HTML locally so
 far(in nutch binary format in segment/content folder).

 The size of the scraping is fairly big. Million pages per day.
 One thing about the HTML pages themselves is that they follow exactly the
 same format.. so I can write a parser in Java to parse out the info I want
 (say unit price, part number...etc) for one page, and that parser will work
 for most of the pages..

 I am wondering is there some map reduce template already written so I can
 just replace the parser with my customized one and easily start a hadoop
 mapreduce job. (actually, there doesn't have to be any reduce job... in
 this case, we map every page to the parsed result and that is it...)

 I was looking at the map reduce example here:
 https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
 But I have some problem translating that into my real-world nutch problem.

 I know run map reduce against Nutch binary file will be a bit different
 than word count. I looked at the source code of Nutch and to me, it looks
 like the file are a sequence files of records where each records is a
 key/value pair where key is text type and value is
 org.apache.nutch.protocol.Content type. Then how should I configure the map
 job so it can read in the raw big content binary file and do the Inputsplit
 correctly and run the map job..

 Thanks a lot!

 /usr/bin


 ( Some explanations of why I decided not to write Java plugin ):
 I was thinking about writing a Nutch Plugin so it will be handy to parse
 the scraped data using Nutch command. However, the problem here is it is
 hard to write a perfect parser in one go. It probably makes a lot of sense
 for the people who deal with parsers a lot. You locate your HTML tag by
 some specific features that you think will be general... css class type,
 id...etc...even combining with regular expression. However, when you apply
 your logic to all the pages, it won't stand true for all the pages. Then
 you need to write many different parsers to run against the whole dataset
 (Million pages) in one go and see which one has the best performance. Then
 you run your parser against all your snapshots days * million pages.. to
 get the new dataset.. )



Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

2014-01-02 Thread Tejas Patil
The config 'fs.default.name' of core-site.xml is what makes this happen.
Its default value is file:/// which corresponds to local mode of Hadoop.
In local mode Hadoop looks for paths on the local file system. In
distributed mode of Hadoop, 'fs.default.name' would be
hdfs://IP_OF_NAMENODE/ and it will look for those paths in HDFS.

Thanks,
Tejas


On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang binwang...@gmail.com wrote:

 Hi there,

 When I went through the source code of Nutch - the ParseSegment class,
 which is the class to parse content in a segment. Here is its map reduce
 job configuration part.

 http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
   (Line
 199 - 213)

 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName(parse  +
 segment); 201  202 FileInputFormat.addInputPath(job, new Path(segment,
 Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY,
 segment.getName()); 204 job.setInputFormat(SequenceFileInputFormat.class);
 205 job.setMapperClass(ParseSegment.class); 206
 job.setReducerClass(ParseSegment.class); 207  208 
 FileOutputFormat.setOutputPath(job,
 segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210
 job.setOutputKeyClass(Text.class); 211
 job.setOutputValueClass(ParseImpl.class); 212  213 JobClient.runJob(job);
 Here, in line 202 and line 208, the map reduce input/output path has been
 configured by calling methods addInputPath/setOutputPath from
 FileInputFormat.
 And it is the absolute path in the Linux OS instead of HDFS virtual path.

 And on the other hand, when I look at the WordCount example in the hadoop
 homepage.
 https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55)

 39.  JobConf conf = new JobConf(WordCount.class); 40.
 conf.setJobName(wordcount); 41. 42.
 conf.setOutputKeyClass(Text.class); 43.
 conf.setOutputValueClass(IntWritable.class); 44. 45.
 conf.setMapperClass(Map.class); 46.
 conf.setCombinerClass(Reduce.class); 47.
 conf.setReducerClass(Reduce.class); 48. 49.
 conf.setInputFormat(TextInputFormat.class); 50.
 conf.setOutputFormat(TextOutputFormat.class); 51. 52. 
 FileInputFormat.setInputPaths(conf,
 new Path(args[0])); 53.  FileOutputFormat.setOutputPath(conf, new
 Path(args[1])); 54. 55. JobClient.runJob(conf);
 Here, the input/output path was configured in the same way as Nutch but
 the path was actually passed by passing the arguments.
 bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
 /usr/joe/wordcount/input /usr/joe/wordcount/output
 And we can see the paths passed to the program are actually HDFS path..
  not Linux OS path..
 I am confused here is there some other configuration that I missed which
 lead to the run environment difference? In which case, should I pass
 absolute or HDFS path?

 Thanks a lot!

 /usr/bin




[jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861217#comment-13861217
 ] 

Tejas Patil commented on NUTCH-356:
---

+1 for commit.

 Plugin repository cache can lead to memory leak
 ---

 Key: NUTCH-356
 URL: https://issues.apache.org/jira/browse/NUTCH-356
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Enrico Triolo
 Fix For: 2.3, 1.8

 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, 
 ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch, cache_classes.patch


 While I was trying to solve a problem I reported a while ago (see Nutch-314), 
 I found out that actually the problem was related to the plugin cache used in 
 class PluginRepository.java.
 As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
 work, since I need to frequently submit new urls and append their contents to 
 the index; I don't (and I can't) have an urls.txt file with all urls I'm 
 going to fetch, but I recreate it each time a new url is submitted.
 Thus,  I think in the majority of times you won't have problems using nutch 
 as-is, since the problem I found occours only if nutch is used in a way 
 similar to the one I use.
 To simplify your test I'm attaching a class that performs something similar 
 to what I need. It fetches and index some sample urls; to avoid webmasters 
 complaints I left the sample urls list empty, so you should modify the source 
 code and add some urls.
 Then you only have to run it and watch your memory consumption with top. In 
 my experience I get an OutOfMemoryException after a couple of minutes, but it 
 clearly depends on your heap settings and on the plugins you are using (I'm 
 using 
 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
 The problem is bound to the PluginRepository 'singleton' instance, since it 
 never get released. It seems that some class maintains a reference to it and 
 this class is never released since it is cached somewhere in the 
 configuration.
 So I modified the PluginRepository's 'get' method so that it never uses the 
 cache and always returns a new instance (you can find the patch in 
 attachment). This way the memory consumption is always stable and I get no 
 OOM anymore.
 Clearly this is not the solution, since I guess there are many performance 
 issues involved, but for the moment it works.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)