[jira] [Commented] (NUTCH-1325) HostDB for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860114#comment-13860114 ] Markus Jelsma commented on NUTCH-1325: -- Hi Tejas - i think most seems fine now and i like the changes you've made so far and i cannot come up with a better solution right now for the https:// schema filtering issue. Or there any other issues we didn't think about? Anyone else HostDB for Nutch Key: NUTCH-1325 URL: https://issues.apache.org/jira/browse/NUTCH-1325 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, NUTCH-1325.trunk.v2.path A HostDB for Nutch and associated tools to create and read a database containing information on hosts. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860121#comment-13860121 ] Markus Jelsma commented on NUTCH-1080: -- +1! Type safe members , arguments for better readability - Key: NUTCH-1080 URL: https://issues.apache.org/jira/browse/NUTCH-1080 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Karthik K Fix For: 2.3 Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, NUTCH-rel_14-1080.patch Enable generics for some of the API, for better type safety and readability, in the process. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860124#comment-13860124 ] Markus Jelsma commented on NUTCH-1670: -- +1 set same crawldb directory in mergedb parameter --- Key: NUTCH-1670 URL: https://issues.apache.org/jira/browse/NUTCH-1670 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Labels: PatchAvailable Fix For: 1.8 Attachments: NUTCH-1670.patch when merge two crawldb using the same crawldb directory in bin/nutch merge paramater, it will throw data not found exception. bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860140#comment-13860140 ] Markus Jelsma commented on NUTCH-1360: -- Almost all unit tests fail due to improper use of entities in configuration. {code} org.xml.sax.SAXParseException; systemId: file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; lineNumber: 32; columnNumber: 60; The entity name must immediately follow the '' in the entity reference. java.lang.RuntimeException: org.xml.sax.SAXParseException; systemId: file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; lineNumber: 32; columnNumber: 60; The entity name must immediately follow the '' in the entity reference. at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1249) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1117) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1053) at org.apache.hadoop.conf.Configuration.get(Configuration.java:460) at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:131) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:123) at org.apache.nutch.crawl.TestCrawlDbFilter.setUp(TestCrawlDbFilter.java:50) Caused by: org.xml.sax.SAXParseException; systemId: file:/home/markus/projects/apache/nutch/trunk/conf/nutch-default.xml; lineNumber: 32; columnNumber: 60; The entity name must immediately follow the '' in the entity reference. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:251) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1156) {code} Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Reopened] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-1360: -- Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860142#comment-13860142 ] Markus Jelsma commented on NUTCH-1360: -- {code} --- conf/nutch-default.xml (revision 1554785) +++ conf/nutch-default.xml (working copy) @@ -29,7 +29,7 @@ valuefalse/value descriptionEnables us to capture the specific IP address (InetSocketAddress) of the host which we connect to via - the given protocol. Currently supported is protocol-ftp + the given protocol. Currently supported is protocol-ftp and http. /description /property {code} will commit shortly Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-356: Attachment: NUTCH-356-trunk.patch Updated patch for trunk. All tests pass. According to http://lucene.472066.n3.nabble.com/Memory-leak-when-crawling-repeatedly-td4106960.html this patch should resolve the issue. Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: https://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Fix For: 2.3, 1.8 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch, cache_classes.patch While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860153#comment-13860153 ] Markus Jelsma commented on NUTCH-1360: -- Committed revision 1554791. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.9 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1360. -- Resolution: Fixed This issue is not in 2.x, just trunk. All tests pass again. Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Build failed in Jenkins: Nutch-trunk #2472
See https://builds.apache.org/job/Nutch-trunk/2472/changes Changes: [markus] NUTCH-1360 fix entity in configuration -- [...truncated 6752 lines...] clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlmeta jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-querystring jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-regex jar: deps-test: init: init-plugin: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: jar: deps-test: deploy: copy-generated-lib: deploy: copy-generated-lib: compile: javadoc: [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.nutch.crawl... [javadoc] Loading source files for package org.apache.nutch.fetcher... [javadoc] Loading source files for package org.apache.nutch.indexer... [javadoc] Loading source files for package org.apache.nutch.metadata... [javadoc] Loading source files for package org.apache.nutch.net... [javadoc] Loading source files for package org.apache.nutch.net.protocols... [javadoc] Loading source files for package org.apache.nutch.parse... [javadoc] Loading source files for package org.apache.nutch.plugin... [javadoc] Loading source files for package org.apache.nutch.protocol... [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130: error: unmappable character for encoding ASCII [javadoc] Loading source files for package org.apache.nutch.scoring... [javadoc] Loading source files for package org.apache.nutch.scoring.webgraph... [javadoc] Loading source files for package org.apache.nutch.segment... [javadoc] Loading source files for package org.apache.nutch.tools... [javadoc] Loading source files for package org.apache.nutch.tools.arc... [javadoc] Loading source files for package org.apache.nutch.tools.proxy... [javadoc] Loading source files for package org.apache.nutch.util... [javadoc]* Simple character substitution which cleans all ??? chars from a given String. [javadoc] Loading source files for package org.apache.nutch.util.domain... [javadoc] Loading source files for package org.creativecommons.nutch... [javadoc] Loading source files for package org.apache.nutch.indexer.feed... [javadoc] Loading source files for package org.apache.nutch.parse.feed... [javadoc] Loading source files for package org.apache.nutch.parse.headings... [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130: error: unmappable character for encoding ASCII [javadoc]* Simple character substitution which cleans all ??? chars from a given String. [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130: error: unmappable character for encoding ASCII [javadoc]* Simple character substitution which cleans all ??? chars from a given String. [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133: error: unmappable character for encoding ASCII [javadoc] return value.replaceAll(???, ); [javadoc] ^ [javadoc]
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860222#comment-13860222 ] Hudson commented on NUTCH-1360: --- FAILURE: Integrated in Nutch-trunk #2472 (See [https://builds.apache.org/job/Nutch-trunk/2472/]) NUTCH-1360 fix entity in configuration (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554791) * /nutch/trunk/conf/nutch-default.xml Suport the storing of IP address connected to when web crawling --- Key: NUTCH-1360 URL: https://issues.apache.org/jira/browse/NUTCH-1360 Project: Nutch Issue Type: New Feature Components: protocol Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1360-NUTCH-289-nutch-1.5.1.patch, NUTCH-1360-nutchgora-v2.patch, NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360-trunkv2.patch, NUTCH-1360-trunkv3.patch, NUTCH-1360-trunkv4.patch, NUTCH-1360v3.patch, NUTCH-1360v4.patch, NUTCH-1360v5.patch Simple issue enabling us to capture the specific IP address of the host which we connect to to fetch a page. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
Markus Jelsma created NUTCH-1691: Summary: DomainBlacklist url filter does not allow -D filter file override Key: NUTCH-1691 URL: https://issues.apache.org/jira/browse/NUTCH-1691 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8, 2.4 This filter does not accept -Durlfilter.domainblacklist.file= overrides. The plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860281#comment-13860281 ] Markus Jelsma commented on NUTCH-1691: -- This means existing behaviour is unchanged, the defaults are still the same. DomainBlacklist url filter does not allow -D filter file override - Key: NUTCH-1691 URL: https://issues.apache.org/jira/browse/NUTCH-1691 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8, 2.4 Attachments: NUTCH-1691-trunk.patch This filter does not accept -Durlfilter.domainblacklist.file= overrides. The plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1691: - Attachment: NUTCH-1691-trunk.patch Patch for trunk. This fixes the issue by defaulting it in nutch-default and commenting out the file attribute in plugin.xml. DomainBlacklist url filter does not allow -D filter file override - Key: NUTCH-1691 URL: https://issues.apache.org/jira/browse/NUTCH-1691 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8, 2.4 Attachments: NUTCH-1691-trunk.patch This filter does not accept -Durlfilter.domainblacklist.file= overrides. The plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860288#comment-13860288 ] Markus Jelsma commented on NUTCH-1691: -- Well, there is a small issue now: {code} WARN domainblacklist.DomainBlacklistURLFilter - Attribute file is not defined in plugin.xml for plugin urlfilter-domainblacklist {code} In my opinion we can remove the INFO and WARN code. {code} if (attributeFile != null) { if (LOG.isInfoEnabled()) { LOG.info(Attribute \file\ is defined for plugin + pluginName + as + attributeFile); } } else { if (LOG.isWarnEnabled()) { LOG.warn(Attribute \file\ is not defined in plugin.xml for plugin + pluginName); } } {code} And only show an ERROR if there is are no rules to work with. What do you think? DomainBlacklist url filter does not allow -D filter file override - Key: NUTCH-1691 URL: https://issues.apache.org/jira/browse/NUTCH-1691 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8, 2.4 Attachments: NUTCH-1691-trunk.patch This filter does not accept -Durlfilter.domainblacklist.file= overrides. The plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1692) SegmentReader broken in distributed mode
Markus Jelsma created NUTCH-1692: Summary: SegmentReader broken in distributed mode Key: NUTCH-1692 URL: https://issues.apache.org/jira/browse/NUTCH-1692 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8 Attachments: NUTCH-1692-trunk.patch SegmentReader -list option ignores the -no* options, causing the following exception in distributed mode: {code} Exception in thread main java.lang.NullPointerException at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) at java.util.Arrays.sort(Arrays.java:472) at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85) at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463) at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441) at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1692) SegmentReader broken in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1692: - Attachment: NUTCH-1692-trunk.patch Patch for trunk. Fix works, issue is gone. SegmentReader broken in distributed mode Key: NUTCH-1692 URL: https://issues.apache.org/jira/browse/NUTCH-1692 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8 Attachments: NUTCH-1692-trunk.patch SegmentReader -list option ignores the -no* options, causing the following exception in distributed mode: {code} Exception in thread main java.lang.NullPointerException at java.util.ComparableTimSort.sort(ComparableTimSort.java:146) at java.util.Arrays.sort(Arrays.java:472) at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85) at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463) at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441) at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:160) {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860643#comment-13860643 ] Tejas Patil commented on NUTCH-1080: Committed to trunk (rev 1554881). Will port the same to 2.x Type safe members , arguments for better readability - Key: NUTCH-1080 URL: https://issues.apache.org/jira/browse/NUTCH-1080 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Karthik K Fix For: 2.3, 1.8 Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, NUTCH-rel_14-1080.patch Enable generics for some of the API, for better type safety and readability, in the process. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860678#comment-13860678 ] Tejas Patil commented on NUTCH-1691: Hi [~markus17], Its a good solution. +1 from me. I would like to know the way you are invoking the plugin. I tried to use bin/nutch plugin urlfilter-domainblacklist but that didn't work as it doesn't got main(). DomainBlacklist url filter does not allow -D filter file override - Key: NUTCH-1691 URL: https://issues.apache.org/jira/browse/NUTCH-1691 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8, 2.4 Attachments: NUTCH-1691-trunk.patch This filter does not accept -Durlfilter.domainblacklist.file= overrides. The plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860739#comment-13860739 ] Hudson commented on NUTCH-1080: --- FAILURE: Integrated in Nutch-trunk #2473 (See [https://builds.apache.org/job/Nutch-trunk/2473/]) NUTCH-1080 Type safe members, arguments for better readability (tejasp: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554881) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java * /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthentication.java * /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationFactory.java * /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpBasicAuthentication.java * /nutch/trunk/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java * /nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java * /nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java * /nutch/trunk/src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java * /nutch/trunk/src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java Type safe members , arguments for better readability - Key: NUTCH-1080 URL: https://issues.apache.org/jira/browse/NUTCH-1080 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Karthik K Assignee: Tejas Patil Fix For: 2.3, 1.8 Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, NUTCH-rel_14-1080.patch Enable generics for some of the API, for better type safety and readability, in the process. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860740#comment-13860740 ] Hudson commented on NUTCH-1670: --- FAILURE: Integrated in Nutch-trunk #2473 (See [https://builds.apache.org/job/Nutch-trunk/2473/]) NUTCH-1670 set same crawldb directory in mergedb parameter (tejasp: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1554883) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbMerger.java set same crawldb directory in mergedb parameter --- Key: NUTCH-1670 URL: https://issues.apache.org/jira/browse/NUTCH-1670 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Labels: PatchAvailable Fix For: 1.8 Attachments: NUTCH-1670.patch when merge two crawldb using the same crawldb directory in bin/nutch merge paramater, it will throw data not found exception. bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1454) parsing chm failed
[ https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13860803#comment-13860803 ] Tejas Patil commented on NUTCH-1454: TIKA-1122 is fixed and I have verified that 'parsechecker' works fine with the same. Upgrading to Tika 1.5 (yet to be released) should fix this for Nutch. parsing chm failed -- Key: NUTCH-1454 URL: https://issues.apache.org/jira/browse/NUTCH-1454 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.5.1 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.9 (reported by Jan Riewe, see http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html) Nutch fails to parse chm files with {quote} ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp {quote} Tested with chm test files from Tika: {code} % bin/nutch parsechecker file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm {code} Tika parses this document (but does not extract any content). -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Nutch Crawl a Specific List Of URLs (150K)
Thanks for all the response, they are very inspiring and diving into the log level is very beneficial to learn Nutch. The fact is that I use Python BeautifulSoup to parse the sitemap of my targeted website, which comes up with those 150K URLs, however, it turned out that there are many many duplicates which actually in the end turned out to be 900 distinct URLs. And Nutch is smart enough to filter out those duplicates and come up with 900 before hitting their websites. On Mon, Dec 30, 2013 at 4:13 AM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, You ran one crawl cycle. Depending on the generator and fetcher settings you are not guaranteerd to fetch 200.000 URL's with only topN specified. Check the logs, the generator will tell you if there are too many URL's for a host or domain. Also check all fetcher logs, it will tell you how much it crawled and why it likely stopped when it did. Cheers -Original message- From: Bin Wangbinwang...@gmail.com Sent: Friday 27th December 2013 19:50 To: dev@nutch.apache.org Subject: Nutch Crawl a Specific List Of URLs (150K) Hi, I have a very specific list of URLs, which is about 140K URLs. I switch off the `db.update.additions.allowed` so it will not update the crawldb... and I was assuming I can feed all the URLs to Nutch, and after one round of fetching, it will finish and leave all the raw HTML files in the segment folder. However, after I run this command: nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 It ended up with a small number of URLs.. TOTAL urls: 872 retry 0:872 min score: 1.0 avg score: 1.0 max score: 1.0 And I double check the log to make sure that every url can pass the filter and normalization. And here is the log: 2013-12-27 17:55:25,068 INFO crawl.Injector - Injector: total number of urls rejected by filters: 0 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: total number of urls injected after normalization and filtering: 139058 2013-12-27 17:55:25,069 INFO crawl.Injector - Injector: Merging injected urls into crawl db. I dont know how 140K URLs ended up being 872 in the end... /usr/bin -- AWS ubuntu instance Nutch 1.7 java version 1.6.0_27 OpenJDK Runtime Environment (IcedTea6 1.12.6) (6b27-1.12.6-1ubuntu0.12.04.4) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
use Map Reduce + Jsoup to parse big Nutch/Content file
Hi, I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary format in segment/content folder). The size of the scraping is fairly big. Million pages per day. One thing about the HTML pages themselves is that they follow exactly the same format.. so I can write a parser in Java to parse out the info I want (say unit price, part number...etc) for one page, and that parser will work for most of the pages.. I am wondering is there some map reduce template already written so I can just replace the parser with my customized one and easily start a hadoop mapreduce job. (actually, there doesn't have to be any reduce job... in this case, we map every page to the parsed result and that is it...) I was looking at the map reduce example here: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html But I have some problem translating that into my real-world nutch problem. I know run map reduce against Nutch binary file will be a bit different than word count. I looked at the source code of Nutch and to me, it looks like the file are a sequence files of records where each records is a key/value pair where key is text type and value is org.apache.nutch.protocol.Content type. Then how should I configure the map job so it can read in the raw big content binary file and do the Inputsplit correctly and run the map job.. Thanks a lot! /usr/bin ( Some explanations of why I decided not to write Java plugin ): I was thinking about writing a Nutch Plugin so it will be handy to parse the scraped data using Nutch command. However, the problem here is it is hard to write a perfect parser in one go. It probably makes a lot of sense for the people who deal with parsers a lot. You locate your HTML tag by some specific features that you think will be general... css class type, id...etc...even combining with regular expression. However, when you apply your logic to all the pages, it won't stand true for all the pages. Then you need to write many different parsers to run against the whole dataset (Million pages) in one go and see which one has the best performance. Then you run your parser against all your snapshots days * million pages.. to get the new dataset.. )
[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store
[ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861142#comment-13861142 ] Tien Nguyen Manh commented on NUTCH-1686: - In this patch i also fixed an bug with fetchTime. Currently each time we run whole updatedb, fetchTime is increased again for all urls. Optimize UpdateDb to load less field from Store --- Key: NUTCH-1686 URL: https://issues.apache.org/jira/browse/NUTCH-1686 Project: Nutch Issue Type: Improvement Affects Versions: 2.3 Reporter: Tien Nguyen Manh Fix For: 2.3 Attachments: NUTCH-1686.patch While running large crawl i found that updatedb run very slow, especially the Map task which loading data from store. We can't use filter by batchId to load less url due to bug in NUTCH-1679 so we must always update the whole table. After checking the field loaded in UpdateDbJob i found that it load many fields from store (at least 15/25 field) which make updatedb slow I think that UpdateDbJob only need to load few field: SCORE, OUTLINKS, METADATA which is used to compute link score, distance that i think the main purpose of this job. The other fields is used to compute url schedule to parser and fetcher, we can move code to Parser or Fetcher whithout loading much new field because many field are generated from parser. WE can also use gora filter for Fetcher or Parser so load new field is not a problem. I also add new field SCOREMETA to WebPage to store CASH, and DISTANCE. It is currently store in METADATA. field CASH is used in OPICScoring which is used only in UpdateDB and distance is used only in Generator and Updater so move both field two new Metadata field can prevent reading METADATA at Generator and Updater, METADATA contains many data that is used only at Parser and Indexer So with new change UpdateDb only load SCORE, SCOREMATA (CASH, DISTANCE), OUTLINKS, MAKERS: we don't need to load big family Fetch and INLINKS. Generator only load SCOREMETA (which is smaller than current METADATA) -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Build failed in Jenkins: Nutch-trunk #2474
See https://builds.apache.org/job/Nutch-trunk/2474/ -- [...truncated 6749 lines...] deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlmeta jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-querystring jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-regex jar: deps-test: init: init-plugin: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: jar: deps-test: deploy: copy-generated-lib: deploy: copy-generated-lib: compile: javadoc: [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.nutch.crawl... [javadoc] Loading source files for package org.apache.nutch.fetcher... [javadoc] Loading source files for package org.apache.nutch.indexer... [javadoc] Loading source files for package org.apache.nutch.metadata... [javadoc] Loading source files for package org.apache.nutch.net... [javadoc] Loading source files for package org.apache.nutch.net.protocols... [javadoc] Loading source files for package org.apache.nutch.parse... [javadoc] Loading source files for package org.apache.nutch.plugin... [javadoc] Loading source files for package org.apache.nutch.protocol... [javadoc] Loading source files for package org.apache.nutch.scoring... [javadoc] Loading source files for package org.apache.nutch.scoring.webgraph... [javadoc] Loading source files for package org.apache.nutch.segment... [javadoc] Loading source files for package org.apache.nutch.tools... [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130: error: unmappable character for encoding ASCII [javadoc]* Simple character substitution which cleans all ??? chars from a given String. [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130: error: unmappable character for encoding ASCII [javadoc]* Simple character substitution which cleans all ??? chars from a given String. [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:130: error: unmappable character for encoding ASCII [javadoc]* Simple character substitution which cleans all ??? chars from a given String. [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133: error: unmappable character for encoding ASCII [javadoc] return value.replaceAll(???, ); [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133: error: unmappable character for encoding ASCII [javadoc] return value.replaceAll(???, ); [javadoc] ^ [javadoc] https://builds.apache.org/job/Nutch-trunk/ws/trunk/src/java/org/apache/nutch/util/StringUtil.java:133: error: unmappable character for encoding ASCII [javadoc] return value.replaceAll(???, ); [javadoc]^ [javadoc] Loading source files for package org.apache.nutch.tools.arc... [javadoc] Loading source files for package org.apache.nutch.tools.proxy... [javadoc] Loading
[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Issue Type: New Feature (was: Bug) TextMD5Signatue compute on textual content -- Key: NUTCH-1693 URL: https://issues.apache.org/jira/browse/NUTCH-1693 Project: Nutch Issue Type: New Feature Reporter: Tien Nguyen Manh Priority: Minor Fix For: 2.3 Attachments: NUTCH-1693.patch I create a new MD5Signature that based on textual content. In our case we use boilerpipe to extract main text from content so this signature is more effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Fix Version/s: 2.3 TextMD5Signatue compute on textual content -- Key: NUTCH-1693 URL: https://issues.apache.org/jira/browse/NUTCH-1693 Project: Nutch Issue Type: New Feature Reporter: Tien Nguyen Manh Priority: Minor Fix For: 2.3 Attachments: NUTCH-1693.patch I create a new MD5Signature that based on textual content. In our case we use boilerpipe to extract main text from content so this signature is more effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861195#comment-13861195 ] Tien Nguyen Manh commented on NUTCH-1693: - this patch only work with a minor change that compute signature after seting text to page that i made in NUTCH-1686 TextMD5Signatue compute on textual content -- Key: NUTCH-1693 URL: https://issues.apache.org/jira/browse/NUTCH-1693 Project: Nutch Issue Type: New Feature Reporter: Tien Nguyen Manh Priority: Minor Fix For: 2.3 Attachments: NUTCH-1693.patch I create a new MD5Signature that based on textual content. In our case we use boilerpipe to extract main text from content so this signature is more effective to deduplicate. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: use Map Reduce + Jsoup to parse big Nutch/Content file
Here is what I would do: If you running a crawl, let it run with the default parser. Write a nutch plugin with your customized parse implementation to evaluate your parse logic. Now get some real segments (with a subset of those million pages) and run only the 'bin/nutch parse' command and see how good it is. That command will run your parser over the segment. Do this till you get a satisfactory parser implementation. ~tejas On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang binwang...@gmail.com wrote: Hi, I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary format in segment/content folder). The size of the scraping is fairly big. Million pages per day. One thing about the HTML pages themselves is that they follow exactly the same format.. so I can write a parser in Java to parse out the info I want (say unit price, part number...etc) for one page, and that parser will work for most of the pages.. I am wondering is there some map reduce template already written so I can just replace the parser with my customized one and easily start a hadoop mapreduce job. (actually, there doesn't have to be any reduce job... in this case, we map every page to the parsed result and that is it...) I was looking at the map reduce example here: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html But I have some problem translating that into my real-world nutch problem. I know run map reduce against Nutch binary file will be a bit different than word count. I looked at the source code of Nutch and to me, it looks like the file are a sequence files of records where each records is a key/value pair where key is text type and value is org.apache.nutch.protocol.Content type. Then how should I configure the map job so it can read in the raw big content binary file and do the Inputsplit correctly and run the map job.. Thanks a lot! /usr/bin ( Some explanations of why I decided not to write Java plugin ): I was thinking about writing a Nutch Plugin so it will be handy to parse the scraped data using Nutch command. However, the problem here is it is hard to write a perfect parser in one go. It probably makes a lot of sense for the people who deal with parsers a lot. You locate your HTML tag by some specific features that you think will be general... css class type, id...etc...even combining with regular expression. However, when you apply your logic to all the pages, it won't stand true for all the pages. Then you need to write many different parsers to run against the whole dataset (Million pages) in one go and see which one has the best performance. Then you run your parser against all your snapshots days * million pages.. to get the new dataset.. )
Re: How Map Reduce code in Nutch run in local mode vs distributed mode?
The config 'fs.default.name' of core-site.xml is what makes this happen. Its default value is file:/// which corresponds to local mode of Hadoop. In local mode Hadoop looks for paths on the local file system. In distributed mode of Hadoop, 'fs.default.name' would be hdfs://IP_OF_NAMENODE/ and it will look for those paths in HDFS. Thanks, Tejas On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang binwang...@gmail.com wrote: Hi there, When I went through the source code of Nutch - the ParseSegment class, which is the class to parse content in a segment. Here is its map reduce job configuration part. http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup (Line 199 - 213) 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName(parse + segment); 201 202 FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY, segment.getName()); 204 job.setInputFormat(SequenceFileInputFormat.class); 205 job.setMapperClass(ParseSegment.class); 206 job.setReducerClass(ParseSegment.class); 207 208 FileOutputFormat.setOutputPath(job, segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210 job.setOutputKeyClass(Text.class); 211 job.setOutputValueClass(ParseImpl.class); 212 213 JobClient.runJob(job); Here, in line 202 and line 208, the map reduce input/output path has been configured by calling methods addInputPath/setOutputPath from FileInputFormat. And it is the absolute path in the Linux OS instead of HDFS virtual path. And on the other hand, when I look at the WordCount example in the hadoop homepage. https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55) 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName(wordcount); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. conf.setCombinerClass(Reduce.class); 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); Here, the input/output path was configured in the same way as Nutch but the path was actually passed by passing the arguments. bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output And we can see the paths passed to the program are actually HDFS path.. not Linux OS path.. I am confused here is there some other configuration that I missed which lead to the run environment difference? In which case, should I pass absolute or HDFS path? Thanks a lot! /usr/bin
[jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861217#comment-13861217 ] Tejas Patil commented on NUTCH-356: --- +1 for commit. Plugin repository cache can lead to memory leak --- Key: NUTCH-356 URL: https://issues.apache.org/jira/browse/NUTCH-356 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Enrico Triolo Fix For: 2.3, 1.8 Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch, cache_classes.patch While I was trying to solve a problem I reported a while ago (see Nutch-314), I found out that actually the problem was related to the plugin cache used in class PluginRepository.java. As I said in Nutch-314, I think I somehow 'force' the way nutch is meant to work, since I need to frequently submit new urls and append their contents to the index; I don't (and I can't) have an urls.txt file with all urls I'm going to fetch, but I recreate it each time a new url is submitted. Thus, I think in the majority of times you won't have problems using nutch as-is, since the problem I found occours only if nutch is used in a way similar to the one I use. To simplify your test I'm attaching a class that performs something similar to what I need. It fetches and index some sample urls; to avoid webmasters complaints I left the sample urls list empty, so you should modify the source code and add some urls. Then you only have to run it and watch your memory consumption with top. In my experience I get an OutOfMemoryException after a couple of minutes, but it clearly depends on your heap settings and on the plugins you are using (I'm using 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic'). The problem is bound to the PluginRepository 'singleton' instance, since it never get released. It seems that some class maintains a reference to it and this class is never released since it is cached somewhere in the configuration. So I modified the PluginRepository's 'get' method so that it never uses the cache and always returns a new instance (you can find the patch in attachment). This way the memory consumption is always stable and I get no OOM anymore. Clearly this is not the solution, since I guess there are many performance issues involved, but for the moment it works. -- This message was sent by Atlassian JIRA (v6.1.5#6160)