[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485299#comment-14485299 ] lufeng commented on NUTCH-1854: --- if we set fetcher.store.content=false and fetcher.parse=false then the bin/nutch parse command will throw exception to check the input content directory exist. So I think why we need this parameter because something we set the fetcher.parse to true and don't want to store the content because of slow disk or not much disk space. So I think we can remove this parameter of fetcher.store.content and if the parameter of fetcher.parse=true we don't store the page content. ./bin/crawl fails with a parsing fetcher Key: NUTCH-1854 URL: https://issues.apache.org/jira/browse/NUTCH-1854 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.9 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.11 Attachments: NUTCH-1854ver1.patch If you run ./bin/crawl with a parsing fetcher e.g. property namefetcher.parse/name valuefalse/value descriptionIf true, fetcher will parse content. Default is false, which means that a separate parsing step is required after fetching is finished./description /property we get a horrible message as follows Exception in thread main java.io.IOException: Segment already parsed! We could improve this by making logging more complete and by adding a trigger to the crawl script which would check for crawl_parse for a given segment and then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315374#comment-14315374 ] lufeng commented on NUTCH-1939: --- Hi Sebastian One question. How do you use the FetchItem returned by queueRedirect method. I don't find any code to use this returned object. I think queueRedirect method has already add this redirect url back to fetch queue. Fetcher fails to follow redirects - Key: NUTCH-1939 URL: https://issues.apache.org/jira/browse/NUTCH-1939 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.9 Reporter: Sebastian Nagel Fix For: 1.10 Attachments: NUTCH-1939.patch As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with http.redirect.max 0 Fetcher does not follow redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-1939) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315374#comment-14315374 ] lufeng edited comment on NUTCH-1939 at 2/11/15 2:16 AM: I think that's correct. +1 was (Author: amuseme.lu): Hi Sebastian One question. How do you use the FetchItem returned by queueRedirect method. I don't find any code to use this returned object. I think queueRedirect method has already add this redirect url back to fetch queue. Fetcher fails to follow redirects - Key: NUTCH-1939 URL: https://issues.apache.org/jira/browse/NUTCH-1939 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.9 Reporter: Sebastian Nagel Fix For: 1.10 Attachments: NUTCH-1939.patch As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with http.redirect.max 0 Fetcher does not follow redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1829) Generator : unable to distinguish real errors
[ https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110193#comment-14110193 ] lufeng commented on NUTCH-1829: --- yes, I think we should distinguish different return result using different return code. So we can determine the next action according to this return code. Generator : unable to distinguish real errors - Key: NUTCH-1829 URL: https://issues.apache.org/jira/browse/NUTCH-1829 Project: Nutch Issue Type: Bug Components: nutchNewbie Affects Versions: 1.9, 2.2.1 Environment: Ubuntu Server 14.04, OpenJDK 7 Reporter: Mathieu Bouchard The bin/nutch generate command is returning the same error code (-1) if there is an error or no new segment to process, so there is no way to tell if the error is real or not from a shell script. This problem is related to NUTCH-1828. The problem can be fixed by modifying the following Java source file: http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?revision=1619934view=markup At line 711, if there are no new segment, the generator returns -1, which is the same return code returned at line 714 if there was an error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045525#comment-14045525 ] lufeng commented on NUTCH-385: -- Hi Julien I see the description of fetcher.threads.per.queue we can add setting fetcher.threads.per.queue to value 1 will also cause fetcher.server.delay to be ignore. Another issue is that I think this property fetcher.max.crawl.delay is not uniform with fetcher.server.delay and fetcher.server.min.delay. It is changed to fetcher.server.max.delay more suitable? Improve description of thread related configuration for Fetcher --- Key: NUTCH-385 URL: https://issues.apache.org/jira/browse/NUTCH-385 Project: Nutch Issue Type: Bug Components: documentation, fetcher Reporter: Chris Schneider Assignee: Julien Nioche Fix For: 1.9 Attachments: NUTCH-385.patch For some time I've been puzzled by the interaction between two paramters that control how often the fetcher can access a particular host: 1) The server delay, which comes back from the remote server during our processing of the robots.txt file, and which can be limited by fetcher.max.crawl.delay. 2) The fetcher.threads.per.host value, particularly when this is greater than the default of 1. According to my (limited) understanding of the code in HttpBase.java: Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher ends up keeping either 1 or 2 fetcher threads pointing at a particular host continuously. In other words, it never tries to point 3 at the host, and it always points a second thread at the host before the first thread finishes accessing it. Since HttpBase.unblockAddr never gets called with (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. Thus, the server delay will never be used at all. The fetcher will be continuously retrieving pages from the host, often with 2 fetchers accessing the host simultaneously. Suppose instead that the fetcher finally does allow the last thread to complete before it gets around to pointing another thread at the target host. When the last fetcher thread calls HttpBase.unblockAddr, it will now put System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the host. This, in turn, will prevent any threads from accessing this host until the delay is complete, even though zero threads are currently accessing the host. I see this behavior as inconsistent. More importantly, the current implementation certainly doesn't seem to answer my original question about appropriate definitions for what appear to be conflicting parameters. In a nutshell, how could we possibly honor the server delay if we allow more than one fetcher thread to simultaneously access the host? It would be one thing if whenever (fetcher.threads.per.host 1), this trumped the server delay, causing the latter to be ignored completely. That is certainly not the case in the current implementation, as it will wait for server delay whenever the number of threads accessing a given host drops to zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010889#comment-14010889 ] lufeng commented on NUTCH-1785: --- +1 elasticsearch 1.2.0 test ok. one question is why convert content byte[] to String type? If one segment contain both html and PDF or mp3 content. How to set this base64 parameter? Ability to index raw content Key: NUTCH-1785 URL: https://issues.apache.org/jira/browse/NUTCH-1785 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.9 Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (NUTCH-1521) CrawlDbFilter pass null url to urlNormailzers
[ https://issues.apache.org/jira/browse/NUTCH-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1521. - Resolution: Fixed Fix Version/s: (was: 2.4) 1.9 CrawlDbFilter pass null url to urlNormailzers - Key: NUTCH-1521 URL: https://issues.apache.org/jira/browse/NUTCH-1521 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Trivial Fix For: 1.9 Attachments: CrawlDbFilter_v1.patch, NUTCH-1521-trunk.patch, TestCrawlDbFilter.java urlNormalizers will get null url if we set CRAWLDB_PURGE_404, and it will throw NullPointerException. and the WARN Log will output something like this Skipping null NullPointerException. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969601#comment-13969601 ] lufeng commented on NUTCH-1726: --- Hi all, Can someone free to check this patch? thanks. HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.9 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port
[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964219#comment-13964219 ] lufeng commented on NUTCH-1752: --- Do you mean different port with same protocol and host has different robots.txt file? +1 cache robots.txt rules per protocol:host:port - Key: NUTCH-1752 URL: https://issues.apache.org/jira/browse/NUTCH-1752 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.8, 2.2.1 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1752-v1.patch HttpRobotRulesParser caches rules from {{robots.txt}} per protocol:host (before NUTCH-1031 caching was per host only). The caching should be per protocol:host:port. In doubt, a request to a different port may deliver a different {{robots.txt}}. Applying robots.txt rules to a combination of host, protocol, and port is common practice: [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not mention this explicitly (could be derived from examples) but others do: * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: each protocol and port needs its own robots.txt file * [Google webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]: The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1733) parse-html to support HTML5 charset definitions
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13938867#comment-13938867 ] lufeng commented on NUTCH-1733: --- +1 pass all tests parse-html to support HTML5 charset definitions --- Key: NUTCH-1733 URL: https://issues.apache.org/jira/browse/NUTCH-1733 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.8, 2.2.1 Reporter: Sebastian Nagel Fix For: 2.3, 1.9 Attachments: NUTCH-1733-trunk.patch, charset_bom_html5.html, charset_html5.html HTML 5 allows to specify the character encoding of a page per * {{meta charset=...}} * Unicode Byte Order Mark (BOM) These are allowed in addition to previous HTTP/http-equiv Content-Type, see [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]]. Parse-html ignores both meta charset and BOM, falls back to the default encoding (cp1252). Parse-tika sets the encoding appropriately. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked
[ https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13937426#comment-13937426 ] lufeng commented on NUTCH-1736: --- Hi ysc you can check the content size to fix this issue like this. {code:java} if (http.getMaxContent() = 0 (contentBytesRead + chunkLen) http.getMaxContent() ) chunkLen= http.getMaxContent() - contentBytesRead; {code} Can't fetch page if http response header contains Transfer-Encoding:chunked --- Key: NUTCH-1736 URL: https://issues.apache.org/jira/browse/NUTCH-1736 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1 Reporter: ysc Priority: Critical Fix For: 2.3, 1.9 Attachments: nutch-2.2.1.patch, nutch1.7.patch Original Estimate: 24h Remaining Estimate: 24h fetching: http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html Fetch failed with protocol status: EXCEPTION: java.io.IOException: unzipBestEffort returned null -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910355#comment-13910355 ] lufeng commented on NUTCH-1726: --- Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:bash} svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 cd nutch-svn2 patch -p0 NUTCH-1726-trunk.patch ant cd src/plugin/headings/ ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910355#comment-13910355 ] lufeng edited comment on NUTCH-1726 at 2/24/14 2:41 PM: Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:java} svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 cd nutch-svn2 patch -p0 NUTCH-1726-trunk.patch ant cd src/plugin/headings/ ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng was (Author: amuseme.lu): Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:bash} svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 cd nutch-svn2 patch -p0 NUTCH-1726-trunk.patch ant cd src/plugin/headings/ ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13900432#comment-13900432 ] lufeng commented on NUTCH-1726: --- Hi Markus. But I didn't find any error using your newest patch. {code:xml} test: [echo] Testing plugin: headings [junit] Running org.apache.nutch.parse.headings.TestHeadingsParseFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.142 sec BUILD SUCCESSFUL Total time: 3 seconds {code} * maybe you can truncate log headers if it's size is larger than the value of maxlength option. so headings.truncate option can be removed. HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1726: -- Attachment: NUTCH-1726-trunk-v2.patch add a test case to check HeadingsFilter patch. :) HeadingsFilter does not find nested nodes - Key: NUTCH-1726 URL: https://issues.apache.org/jira/browse/NUTCH-1726 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.8 Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch Filter won't find: {code} h1spanapache nutch/span/h1 {code} The getNodeValue() tries to read data from children but should traverse nodes instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861502#comment-13861502 ] lufeng commented on NUTCH-1691: --- like urlfilter-prefix plugin, we can move WARN code to maintain the code unity. :) DomainBlacklist url filter does not allow -D filter file override - Key: NUTCH-1691 URL: https://issues.apache.org/jira/browse/NUTCH-1691 Project: Nutch Issue Type: Bug Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8, 2.4 Attachments: NUTCH-1691-trunk.patch This filter does not accept -Durlfilter.domainblacklist.file= overrides. The plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1667) Updatedb always ignore batchId
[ https://issues.apache.org/jira/browse/NUTCH-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830525#comment-13830525 ] lufeng commented on NUTCH-1667: --- yes, u are right. +1 Updatedb always ignore batchId -- Key: NUTCH-1667 URL: https://issues.apache.org/jira/browse/NUTCH-1667 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Reporter: Nguyen Manh Tien Priority: Minor Attachments: NUTCH-1556-batchId.patch batchId is not set in currentJob because we set batchId after creating currentJob, so in UpdateDbMapper batchId is null and will be assign to -all. I change to set batchId befor creating currentJob -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1671) indexchecker to add digest field
[ https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830530#comment-13830530 ] lufeng commented on NUTCH-1671: --- yes, this field can be used by indexing filters. +1 another question is that should we add check code after parse content like this {code:java} ParseResult parseResult = new ParseUtil(conf).parse(content); if (parseResult == null) { LOG.error(Problem with parse - check log); return (-1); } {code} indexchecker to add digest field Key: NUTCH-1671 URL: https://issues.apache.org/jira/browse/NUTCH-1671 Project: Nutch Issue Type: Bug Affects Versions: 1.7, 2.2.1 Reporter: Sebastian Nagel Priority: Trivial Fix For: 2.3, 1.8 Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch IndexingFiltersChecker does not add field digest as done by IndexerMapReduce. Digest/signature could be also used by indexing filters which then may fail. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (NUTCH-1670) set same crawldb directory in mergedb parameter
lufeng created NUTCH-1670: - Summary: set same crawldb directory in mergedb parameter Key: NUTCH-1670 URL: https://issues.apache.org/jira/browse/NUTCH-1670 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 when merge two crawldb using the same crawldb directory in bin/nutch merge paramater, it will throw data not found exception. bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1670: -- Attachment: NUTCH-1670.patch set same crawldb directory in mergedb parameter --- Key: NUTCH-1670 URL: https://issues.apache.org/jira/browse/NUTCH-1670 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 Attachments: NUTCH-1670.patch when merge two crawldb using the same crawldb directory in bin/nutch merge paramater, it will throw data not found exception. bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Work started] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1670 started by lufeng. set same crawldb directory in mergedb parameter --- Key: NUTCH-1670 URL: https://issues.apache.org/jira/browse/NUTCH-1670 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 Attachments: NUTCH-1670.patch when merge two crawldb using the same crawldb directory in bin/nutch merge paramater, it will throw data not found exception. bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set
[ https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812840#comment-13812840 ] lufeng commented on NUTCH-1651: --- Hi Lewis yes, the patch is ok, and this a way to set ModifiedTime. +1 modifiedTime and prevmodifiedTime never set Key: NUTCH-1651 URL: https://issues.apache.org/jira/browse/NUTCH-1651 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1651.patch modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime is set only once in the beginning by zero-control of AdaptiveFetchScheduler. But this is not sufficient since modifiedTime needs to be updated whenever last modified time is available. We corrected this with a patch. Also we noticed that prevModifiedTime is not written to database and we corrected that too. With this patch, whenever lastModifiedTime is available, we do two things. First we set modifiedTime in the Page object to prevModifiedTime. After that we set lastModifiedTime to modifiedTime. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set
[ https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809081#comment-13809081 ] lufeng commented on NUTCH-1651: --- Hi Talat yes, u are right, lastModified is a fetch parameter, but this can also be set by parser plugins, because this attribute can also defined by parsers. it's a attribute of WebPage. I don't find any code in Nutch 2.x to set the ModifiedTime in WebPage, also not find in Nutch1.x. very strange. modifiedTime and prevmodifiedTime never set Key: NUTCH-1651 URL: https://issues.apache.org/jira/browse/NUTCH-1651 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1651.patch modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime is set only once in the beginning by zero-control of AdaptiveFetchScheduler. But this is not sufficient since modifiedTime needs to be updated whenever last modified time is available. We corrected this with a patch. Also we noticed that prevModifiedTime is not written to database and we corrected that too. With this patch, whenever lastModifiedTime is available, we do two things. First we set modifiedTime in the Page object to prevModifiedTime. After that we set lastModifiedTime to modifiedTime. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set
[ https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808045#comment-13808045 ] lufeng commented on NUTCH-1651: --- Hi Talat but I think get last modified from header is not appropriate to put in here. If user want to check the modification of a html in parser plugin through it's content of that url not that metadata in html headers. even the value of Last-Modified in headers is changed. {code:java} +Utf8 lastModified = page.getFromHeaders(new Utf8(Last-Modified)); +if ( lastModified != null ){ + try { +modifiedTime = HttpDateFormat.toLong(lastModified.toString()); +prevModifiedTime = page.getModifiedTime(); + } catch (Exception e) { + } +} {code} maybe appropriate way is to let parser plugin defined by user to set the value of modified time not in DbUpdateReducer class. modifiedTime and prevmodifiedTime never set Key: NUTCH-1651 URL: https://issues.apache.org/jira/browse/NUTCH-1651 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1651.patch modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime is set only once in the beginning by zero-control of AdaptiveFetchScheduler. But this is not sufficient since modifiedTime needs to be updated whenever last modified time is available. We corrected this with a patch. Also we noticed that prevModifiedTime is not written to database and we corrected that too. With this patch, whenever lastModifiedTime is available, we do two things. First we set modifiedTime in the Page object to prevModifiedTime. After that we set lastModifiedTime to modifiedTime. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1645: -- Attachment: NUTCH-1645-v3.patch 1. add an implementation of reaches a lower number of misses would cause the test to fail 2. improve the code style yes, you are right, this unit test only check for the equality of some key statistics as you said. But the problem is how to write test case to verify the correctness of some algorithms in Nutch like AdaptiveFetchSchedule class and find the bug that you pointed in (NUTCH-1564)? Could you give me some suggestions. and I will check the NUTCH-1564 and hope to find a solution to this issue. Thanks Sebastian Junit Test Case for Adaptive Fetch Schedule class - Key: NUTCH-1645 URL: https://issues.apache.org/jira/browse/NUTCH-1645 Project: Nutch Issue Type: Test Affects Versions: 2.2.1 Reporter: Talat UYARER Priority: Minor Fix For: 2.3 Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch, NUTCH-1645-v3.patch Currently there is not Test Case for Adaptive Fetch Schedule. Junit test Writes for its. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1645: -- Attachment: NUTCH-1645-v2.patch add two test case, one is use default parameters and another without open sync delta. thanks Yasin, you can add another test case with some parameter change. Junit Test Case for Adaptive Fetch Schedule class - Key: NUTCH-1645 URL: https://issues.apache.org/jira/browse/NUTCH-1645 Project: Nutch Issue Type: Test Affects Versions: 2.2.1 Reporter: Talat UYARER Priority: Minor Fix For: 2.3 Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch Currently there is not Test Case for Adaptive Fetch Schedule. Junit test Writes for its. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1650) Adaptive Fetch Scheduler interval Wrong Set
[ https://issues.apache.org/jira/browse/NUTCH-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13787664#comment-13787664 ] lufeng commented on NUTCH-1650: --- yes , this code in Nutch 1.x is correct. +1 Adaptive Fetch Scheduler interval Wrong Set --- Key: NUTCH-1650 URL: https://issues.apache.org/jira/browse/NUTCH-1650 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Priority: Minor Labels: scheduler Fix For: 2.3 Attachments: NUTCH-1650.patch After calculation interval time when setting it didn't check between max and min values. Moreover if sync_delta is true. Interval set before changes. This patch fix this. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765410#comment-13765410 ] lufeng commented on NUTCH-1556: --- oh, I'm so sorry, I already fixed this problem. commit revision 1522566 in 2.x HEAD. thanks Julien. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1636) Indexer to normalize and filter repr URL
[ https://issues.apache.org/jira/browse/NUTCH-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13761888#comment-13761888 ] lufeng commented on NUTCH-1636: --- yes, this patch can solve the issue reported by lain. +1 Indexer to normalize and filter repr URL Key: NUTCH-1636 URL: https://issues.apache.org/jira/browse/NUTCH-1636 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 1.7 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.8 Attachments: NUTCH-1636-1.patch Indexer if used with option -normalize and/or -filter (cf. NUTCH-1300) should also normalize and filter representation URLs. Otherwise URLs which are target of a redirect, and have repr URL set (see URLUtil.chooseRepr) may show up in index with an undesirable URL. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759123#comment-13759123 ] lufeng commented on NUTCH-1556: --- Committed revision 1520332 in 2.x HEAD Thanks kaveh. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1556. --- Resolution: Fixed enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756080#comment-13756080 ] lufeng commented on NUTCH-1556: --- I will commit this unless there are objections enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752432#comment-13752432 ] lufeng commented on NUTCH-1556: --- thanks kaveh. +1 enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1556: -- Attachment: NUTCH-1556-v2.patch new patch merged with issue 1632 enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1632) add batchId argument for DbUpdaterJob
lufeng created NUTCH-1632: - Summary: add batchId argument for DbUpdaterJob Key: NUTCH-1632 URL: https://issues.apache.org/jira/browse/NUTCH-1632 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 2.2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 add batchId argument for DbUpdaterJob, you can put the batchId to DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1632) add batchId argument for DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1632: -- Attachment: NUTCH-1632.patch add batchId argument for DbUpdaterJob - Key: NUTCH-1632 URL: https://issues.apache.org/jira/browse/NUTCH-1632 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 2.2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1632.patch add batchId argument for DbUpdaterJob, you can put the batchId to DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1632) add batchId argument for DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1632. - Resolution: Duplicate add batchId argument for DbUpdaterJob - Key: NUTCH-1632 URL: https://issues.apache.org/jira/browse/NUTCH-1632 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 2.2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1632.patch add batchId argument for DbUpdaterJob, you can put the batchId to DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750803#comment-13750803 ] lufeng commented on NUTCH-1556: --- Hi Lewis, I'm sorry, I generate a duplicate issue. I will merge these two patch into one and can you check this out. thanks. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1632) add batchId argument for DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750804#comment-13750804 ] lufeng commented on NUTCH-1632: --- Hi kaveh, I'm sorry and I will close this issue and merge the two patch into one. thanks. add batchId argument for DbUpdaterJob - Key: NUTCH-1632 URL: https://issues.apache.org/jira/browse/NUTCH-1632 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 2.2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1632.patch add batchId argument for DbUpdaterJob, you can put the batchId to DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749663#comment-13749663 ] lufeng commented on NUTCH-1619: --- Hi Julien,I have already fixed the compilation bug, and I will be pay attention in the next time, thanks for reminding. Writes Dmoz Description and Title information to db with snippet argument - Key: NUTCH-1619 URL: https://issues.apache.org/jira/browse/NUTCH-1619 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Yasin Kılınç Priority: Minor Fix For: 2.3 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch We need Dmoz information of fetched URLs can be written to database. So these information can be used like snipppet by indexer of the search engine we are working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749409#comment-13749409 ] lufeng commented on NUTCH-1619: --- Committed @revision 1517147 in 2.x HEAD Thank you very much Talat for the patch. Writes Dmoz Description and Title information to db with snippet argument - Key: NUTCH-1619 URL: https://issues.apache.org/jira/browse/NUTCH-1619 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Yasin Kılınç Priority: Minor Fix For: 2.3 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch We need Dmoz information of fetched URLs can be written to database. So these information can be used like snipppet by indexer of the search engine we are working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1619. --- Resolution: Fixed Writes Dmoz Description and Title information to db with snippet argument - Key: NUTCH-1619 URL: https://issues.apache.org/jira/browse/NUTCH-1619 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Yasin Kılınç Priority: Minor Fix For: 2.3 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch We need Dmoz information of fetched URLs can be written to database. So these information can be used like snipppet by indexer of the search engine we are working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749419#comment-13749419 ] lufeng commented on NUTCH-1619: --- I'm so sorry, DataStore may not throw IOException. It has already been fixed. Committed @revision 1517155 in 2.x HEAD Writes Dmoz Description and Title information to db with snippet argument - Key: NUTCH-1619 URL: https://issues.apache.org/jira/browse/NUTCH-1619 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Yasin Kılınç Priority: Minor Fix For: 2.3 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch We need Dmoz information of fetched URLs can be written to database. So these information can be used like snipppet by indexer of the search engine we are working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1631) Display Document Count Added To Solr Server
[ https://issues.apache.org/jira/browse/NUTCH-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748595#comment-13748595 ] lufeng commented on NUTCH-1631: --- Good statistical methods. +1 Display Document Count Added To Solr Server --- Key: NUTCH-1631 URL: https://issues.apache.org/jira/browse/NUTCH-1631 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1, 2.2, 2.2.1 Reporter: Furkan KAMACI Priority: Minor Fix For: 2.3 Attachments: NUTCH-1631.patch Currently you can not see how many documents are added to Solr Server from Nutch. One should be able to see how many documents are added to Solr Server simultaneously (as a hadoop counter) and also total document count should be logged too after all documents are added to Solr Server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747558#comment-13747558 ] lufeng commented on NUTCH-1619: --- Thanks Talat. +1 for commit. Writes Dmoz Description and Title information to db with snippet argument - Key: NUTCH-1619 URL: https://issues.apache.org/jira/browse/NUTCH-1619 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Yasin Kılınç Priority: Minor Fix For: 2.3 Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch We need Dmoz information of fetched URLs can be written to database. So these information can be used like snipppet by indexer of the search engine we are working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13743621#comment-13743621 ] lufeng commented on NUTCH-1619: --- Hi Yasin, Do you forget to close the data store? good. Writes Dmoz Description and Title information to db with snippet argument - Key: NUTCH-1619 URL: https://issues.apache.org/jira/browse/NUTCH-1619 Project: Nutch Issue Type: Improvement Affects Versions: 2.1 Reporter: Yasin Kılınç Priority: Minor Fix For: 2.3 Attachments: NUTCH-DMOZ-Snippet.patch We need Dmoz information of fetched URLs can be written to database. So these information can be used like snipppet by indexer of the search engine we are working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.
[ https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739731#comment-13739731 ] lufeng commented on NUTCH-1294: --- Hi Lewis. Very pleasure. But What can I do something for README.txt? Do you mean I will also change something in https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt. :) IndexClean job with solr implementation. Key: NUTCH-1294 URL: https://issues.apache.org/jira/browse/NUTCH-1294 Project: Nutch Issue Type: Improvement Affects Versions: nutchgora Reporter: Dan Rosher Priority: Minor Fix For: 2.3 Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, NUTCH-1294-v3.patch I started by copying/altering the trunk version of SolrClean, though is was inadequate for our needs. We needed to mark particular pages as gone even though they still might be visible on the web, this implementation abstracts the index cleaning process, has a Solr implementation, and adds a clean index plugin extension that allows others to tailor how pages might be removed from their store. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with 2 threads and added cookie strings for both http protocols
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714701#comment-13714701 ] lufeng commented on NUTCH-1613: --- ok, Does this cookie will effect other urls that these urls don't need any cookie and will receive Bad Request error when using httpclient? It seems not very general so can we able to add a filter to specify the different host using a different cookie. Timeouts in protocol-httpclient when crawling same host with 2 threads and added cookie strings for both http protocols Key: NUTCH-1613 URL: https://issues.apache.org/jira/browse/NUTCH-1613 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 2.2.1 Reporter: Brian Priority: Minor Labels: patch Fix For: 2.3 Attachments: NUTCH-1613.patch 1.) When using protocol-httpclient to crawl a single website (the same host) I would always get a bunch of timeout errors during fetching and the pages with errors would not be fetched. E.g.: 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www (queue crawl delay=0ms) 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following error: org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:95) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) This is because by default the connection pool manager only allows 2 connections per host so if more than 2 threads are used the others will tend to time out waiting to get a connection. The code previously set max connections correctly but not connection per host. 2.) I also added at the same time simple modifications to both protocol-http and protocol-httpclient to allow specifying a cookie string in the conf file to include in request headers. I use this to crawl site content requiring authentication - it is better for me to specify the cookie string for the authentication than go through the whole authentication process and specifying login info. The nutch-site.xml property is the following: property namehttp.cookie_string/name valueXX_AL=authorization_value_goes_here/value descriptionString to use as the cookie value for HTTP requests/description /property Although I use it for authentication it can be used to specify any single cookie string for the crawl (httpclient does support different cookies for different hosts but I did not get into that). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with 2 threads and added cookie strings for both http protocols
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13711150#comment-13711150 ] lufeng commented on NUTCH-1613: --- Does this specified cookie string will effect all crawling urls? Timeouts in protocol-httpclient when crawling same host with 2 threads and added cookie strings for both http protocols Key: NUTCH-1613 URL: https://issues.apache.org/jira/browse/NUTCH-1613 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 2.2.1 Reporter: Brian Priority: Minor Labels: patch Attachments: NUTCH-1613.patch 1.) When using protocol-httpclient to crawl a single website (the same host) I would always get a bunch of timeout errors during fetching and the pages with errors would not be fetched. E.g.: 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www (queue crawl delay=0ms) 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following error: org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting for connection at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.nutch.protocol.httpclient.HttpResponse.init(HttpResponse.java:95) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) This is because by default the connection pool manager only allows 2 connections per host so if more than 2 threads are used the others will tend to time out waiting to get a connection. The code previously set max connections correctly but not connection per host. 2.) I also added at the same time simple modifications to both protocol-http and protocol-httpclient to allow specifying a cookie string in the conf file to include in request headers. I use this to crawl site content requiring authentication - it is better for me to specify the cookie string for the authentication than go through the whole authentication process and specifying login info. The nutch-site.xml property is the following: property namehttp.cookie_string/name valueXX_AL=authorization_value_goes_here/value descriptionString to use as the cookie value for HTTP requests/description /property Although I use it for authentication it can be used to specify any single cookie string for the crawl (httpclient does support different cookies for different hosts but I did not get into that). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13700082#comment-13700082 ] lufeng commented on NUTCH-1602: --- Hi Markus, this output format only used in *normal* output format, not within CSV output format. there are two different crawl datum output format. now the normal output like this, better than previous one. {code:xml} http://www.baidu.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sat Aug 17 22:35:37 CST 2013 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: m1=v22 m3=v3 m2=v2 m5=v5 m4=m4 _pst_=robots_denied(18), lastModified=0 m6=v6 {code} thanks Julien and Tejas. improve the readability of metadata in readdb dump normal -- Key: NUTCH-1602 URL: https://issues.apache.org/jira/browse/NUTCH-1602 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 Attachments: NUTCH-1602.patch the dumped metadata format is not readable. {code:xml} $bin/nutch readdb crawldb/ -dump dir http://www.baidu.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sat Aug 17 22:35:37 CST 2013 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), lastModified=0m6: v6 {code} so I improve the Metadata format to this {code:xml} Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), lastModified=0;m6=v6; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1602. --- Resolution: Fixed improve the readability of metadata in readdb dump normal -- Key: NUTCH-1602 URL: https://issues.apache.org/jira/browse/NUTCH-1602 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch the dumped metadata format is not readable. {code:xml} $bin/nutch readdb crawldb/ -dump dir http://www.baidu.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sat Aug 17 22:35:37 CST 2013 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), lastModified=0m6: v6 {code} so I improve the Metadata format to this {code:xml} Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), lastModified=0;m6=v6; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1600) Injector overwrite does not always work properly
[ https://issues.apache.org/jira/browse/NUTCH-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699034#comment-13699034 ] lufeng commented on NUTCH-1600: --- test work fine. +1 Injector overwrite does not always work properly Key: NUTCH-1600 URL: https://issues.apache.org/jira/browse/NUTCH-1600 Project: Nutch Issue Type: Bug Components: injector Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.8 Attachments: NUTCH-1600-1.8.patch db.injector.update works as it should but db.injector.overwrite doesn't always seem to properly overwrite the record. This issue exists for some time and we've already fixed it in our dist of Nutch. This record just has been updated (interval). {code} Injector: starting at 2013-07-03 10:34:15 Injector: crawlDb: crawl/crawldb Injector: urlDir: seeds Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 9 Injector: Merging injected urls into crawl db. Injector: finished at 2013-07-03 10:34:21, elapsed: 00:00:05 URL: url Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jul 05 12:11:44 CEST 2013 Modified time: Fri Jun 28 12:11:44 CEST 2013 Retries since fetch: 0 Retry interval: 604800 seconds (7 days) Score: 0.0 Signature: ba29ef3e680323a6d0da74c156800e03 Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 {code} If we now overwrite the record, nothing happens. With this patch installed it overwrites the record as it should and also logs update overwrite switches to console: {code} Injector: starting at 2013-07-03 10:36:30 Injector: crawlDb: crawl/crawldb Injector: urlDir: seeds Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 9 Injector: Merging injected urls into crawl db. Injector: overwrite: true Injector: update: false Injector: finished at 2013-07-03 10:36:36, elapsed: 00:00:05 URL: url Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Jul 03 10:36:30 CEST 2013 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 14000 seconds (0 days) Score: 1.0 Signature: null Metadata: fixedInterval: 14000.0 {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1602: -- Attachment: NUTCH-1602.patch improve the readability of metadata in readdb dump normal -- Key: NUTCH-1602 URL: https://issues.apache.org/jira/browse/NUTCH-1602 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 Attachments: NUTCH-1602.patch the dumped metadata format is not readable. {code:xml} $bin/nutch readdb crawldb/ -dump dir http://www.baidu.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sat Aug 17 22:35:37 CST 2013 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), lastModified=0m6: v6 {code} so I improve the Metadata format to this {code:xml} Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), lastModified=0;m6=v6; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1602) improve the readability of metadata in readdb dump normal
lufeng created NUTCH-1602: - Summary: improve the readability of metadata in readdb dump normal Key: NUTCH-1602 URL: https://issues.apache.org/jira/browse/NUTCH-1602 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 the dumped metadata format is not readable. {code:xml} $bin/nutch readdb crawldb/ -dump dir http://www.baidu.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sat Aug 17 22:35:37 CST 2013 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), lastModified=0m6: v6 {code} so I improve the Metadata format to this {code:xml} Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), lastModified=0;m6=v6; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696798#comment-13696798 ] lufeng commented on NUTCH-1594: --- Committed @revision 1498437 in 2.x HEAD. Thanks Canan and Lewis. count variable is never changed in ParseUtil class -- Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1594.patch in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) so even if you define the db.max.outlinks.per.page parameter, it will not take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696854#comment-13696854 ] lufeng commented on NUTCH-1327: --- Hi Markus, I tested you patch, Do you forget to add deploy and test target into src/plugin/build.xml? +1 QueryStringNormalizer - Key: NUTCH-1327 URL: https://issues.apache.org/jira/browse/NUTCH-1327 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.9 Attachments: NUTCH-1327-1.8-1.patch A normalizer for dealing with query strings. Sorting query strings is helpful in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1594) count variable is never in ParseUtil
lufeng created NUTCH-1594: - Summary: count variable is never in ParseUtil Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Priority: Minor Fix For: 2.3 in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1594: -- Description: in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) so even if you define the db.max.outlinks.per.page parameter, it will not take effect. was: in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) Summary: count variable is never changed in ParseUtil class (was: count variable is never in ParseUtil ) count variable is never changed in ParseUtil class -- Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Priority: Minor Fix For: 2.3 in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) so even if you define the db.max.outlinks.per.page parameter, it will not take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1594: -- Patch Info: Patch Available count variable is never changed in ParseUtil class -- Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1594.patch in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) so even if you define the db.max.outlinks.per.page parameter, it will not take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1594: -- Attachment: NUTCH-1594.patch count variable is never changed in ParseUtil class -- Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1594.patch in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) so even if you define the db.max.outlinks.per.page parameter, it will not take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng reassigned NUTCH-1594: - Assignee: lufeng count variable is never changed in ParseUtil class -- Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1594.patch in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count maxOutlinks i outlinks.length; i++) so even if you define the db.max.outlinks.per.page parameter, it will not take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13686830#comment-13686830 ] lufeng commented on NUTCH-1527: --- Thanks Markus, I try the patch and can index the document success. +1 for commit. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13685661#comment-13685661 ] lufeng commented on NUTCH-1527: --- Hi Markus, I have already tested the newest patch on my laptop. very cool. +1 for commit. {code:xml} lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ bin/nutch index crawldb/ segmetns/20130617225826/ Indexer: starting at 2013-06-17 23:46:47 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 500) elastic.max.bulk.size : elastic bulk index length. (default 5001001 ~5MB) Processing remaining requests [docs = 1, length = 7528, total docs = 1] Processing to finalize last execute Previous took in ms 27, including wait 21 Indexer: finished at 2013-06-17 23:46:57, elapsed: 00:00:10 {code} but one question is that should we add elastic.cluster and elastic.index properties into the nutch-default.xml file? Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682380#comment-13682380 ] lufeng commented on NUTCH-1527: --- Hi Markus 1. Elastic search will load the configure file first, so you need to add config/elasticsearch.yml in your runtime/local/config. But I don't find any method to load configure file with configuration. 2. do you still have lucene-core-3.4.jar in you runtime/local/lib directory? or do you add this {code:xml} + dependency org=org.elasticsearch name=elasticsearch rev=0.90.1 +conf=*-default/ {code} code in ivy/ivy.xml file. maybe the elasticsearch can not load class in nutch plugins system. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: Markus Jelsma Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1575. - support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1575.patch can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1545: -- Fix Version/s: (was: 2.3) 2.2 capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13670376#comment-13670376 ] lufeng commented on NUTCH-1545: --- Committed for nutch 2.2 revision 1487875. by Feng. Thanks Tejas and Lewis. capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.3 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1545. --- Resolution: Fixed capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1563. --- Resolution: Fixed FetchSchedule#getFields is never used by GeneraterJob - Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1563.patch The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1563. - FetchSchedule#getFields is never used by GeneraterJob - Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1563.patch The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1575. --- Resolution: Fixed support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1575.patch can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13669351#comment-13669351 ] lufeng commented on NUTCH-1575: --- Committed for 2.2 revision 1487521 by Feng. Thanks Lewis support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1575.patch can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667766#comment-13667766 ] lufeng commented on NUTCH-1527: --- Hi luca,sorry for my delayed reply, yes, you can improve this patch follow you suggestion, can I assign this issue to you, I am willing to testing it. Thanks. Luca. -- Don't Grow Old, Grow Up... :-) Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1527: -- Assignee: (was: lufeng) Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13667775#comment-13667775 ] lufeng commented on NUTCH-1527: --- Hi luca, now you can click assign to me,and then attach you improvement patch, thanks luca. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Priority: Minor Fix For: 2.4 Attachments: NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1563: -- Fix Version/s: (was: 2.3) 2.2 FetchSchedule#getFields is never used by GeneraterJob - Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1563.patch The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13665161#comment-13665161 ] lufeng commented on NUTCH-1563: --- hi Tejas yes, I pushed this pathc to 2.x. https://svn.apache.org/repos/asf/nutch/branches/2.x Do you mean that I pushed to wrong place? FetchSchedule#getFields is never used by GeneraterJob - Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1563.patch The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1575) support solr authentication in nutch 2.x
lufeng created NUTCH-1575: - Summary: support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1575 started by lufeng. support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1575: -- Attachment: NUTCH-1575.patch add solr authentication support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1575.patch can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662003#comment-13662003 ] lufeng commented on NUTCH-1563: --- Committed for 2.2 revision 1484482 by Feng. Thanks Canan and Lewis. FetchSchedule#getFields is never used by GeneraterJob - Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1563.patch The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662057#comment-13662057 ] lufeng commented on NUTCH-1545: --- Hi Tejas yes, the patch is just put random batchId generater from code to crawl script. User don't have to bother this. capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1486) Upgrade to Solr 4.2.1
[ https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13651936#comment-13651936 ] lufeng commented on NUTCH-1486: --- Hi Lewis The dependency version of solr-solrj in pom.xml is still 3.1.0. Should we upgrade it to 4.2.1. Upgrade to Solr 4.2.1 - Key: NUTCH-1486 URL: https://issues.apache.org/jira/browse/NUTCH-1486 Project: Nutch Issue Type: Bug Affects Versions: 1.6, 2.1 Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT Probably 2.2-SNAPHOT Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.7, 2.2 Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, NUTCH-1486-trunk.v2.patch When attempting to configure a 4 multicore 4.0 instance with Nutch schema-solr4.xml file, I get the following exceptions. This has been discussed previously. As I see it we have two options 1. Keep maintaining both schema options 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml Thoughts? {code} SEVERE: Unable to create core: collection4 org.apache.solr.common.SolrException: Unable to use updateLog: _version_field must exist in schema, using indexed=true stored=true and multiValued=false (_version_ does not exist) at org.apache.solr.core.SolrCore.init(SolrCore.java:721) at org.apache.solr.core.SolrCore.init(SolrCore.java:566) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107) at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36) at org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183) at org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491) at org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142) at org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53) at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604) at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535) at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398) at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552) at org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53) at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91) at org.eclipse.jetty.server.Server.doStart(Server.java:263) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) at org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215) at java.security.AccessController.doPrivileged(Native Method) at org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Assigned] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng reassigned NUTCH-1527: - Assignee: lufeng Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.3, 1.8 The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1527: -- Attachment: NUTCH-1527.patch port elasticsearch indexer plugin to nutch trunk. Before u install this patch, you need to install the https://issues.apache.org/jira/browse/NUTCH-1486 first. so I use the newest version of elasticsearch 0.90.0. It use the lucene 4.2.1. I need more testing about this patch, I am a newbie to elastchsearch. Hope any comments about this patch. thanks Lewis. Port nutch-elasticsearch-indexer to Nutch - Key: NUTCH-1527 URL: https://issues.apache.org/jira/browse/NUTCH-1527 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1527.patch The source repos for this can be found here [0]. This issue should be inline with the work already done by Julien and others over at NUTCH-1047. [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1555) Move to commons-cli for command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1555: -- Attachment: NUTCH-1555-v1.patch Lewis: 1. fixed the fetch NPE bug 2. fixed the update not work bug Should we put every tools to use commons-cli? I find that there are 47 files need to be moved. [~wastl-nagel] 1. use eclipse-codeformat.xml to format the source code Thanks Lewis and Sebastian. Move to commons-cli for command line parsing - Key: NUTCH-1555 URL: https://issues.apache.org/jira/browse/NUTCH-1555 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Fix For: 2.2 Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch I just accidentally passed in the following argument to parser job {code} law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse updatedb ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse:false ParserJob: batchId: updatedb ParserJob: success {code} This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1555) Move to commons-cli for command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13641869#comment-13641869 ] lufeng edited comment on NUTCH-1555 at 4/25/13 2:48 PM: Lewis: 1. fixed the fetch NPE bug 2. fixed the update not work bug Should we put every tools to use commons-cli? I find that there are 47 files need to be moved. Sebastian: 1. use eclipse-codeformat.xml to format the source code Thanks Lewis and Sebastian. was (Author: amuseme.lu): Lewis: 1. fixed the fetch NPE bug 2. fixed the update not work bug Should we put every tools to use commons-cli? I find that there are 47 files need to be moved. [~wastl-nagel] 1. use eclipse-codeformat.xml to format the source code Thanks Lewis and Sebastian. Move to commons-cli for command line parsing - Key: NUTCH-1555 URL: https://issues.apache.org/jira/browse/NUTCH-1555 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Fix For: 2.2 Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch I just accidentally passed in the following argument to parser job {code} law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse updatedb ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse:false ParserJob: batchId: updatedb ParserJob: success {code} This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1555) Move to commons-cli for command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13639131#comment-13639131 ] lufeng edited comment on NUTCH-1555 at 4/23/13 2:58 PM: already moved following files command line parsing to commons-cli,because they are used in bin/nutch command line. {code:java} src/java/org/apache/nutch/api/NutchServer.java src/java/org/apache/nutch/crawl/DbUpdaterJob.java src/java/org/apache/nutch/crawl/GeneratorJob.java src/java/org/apache/nutch/crawl/InjectorJob.java src/java/org/apache/nutch/crawl/WebTableReader.java src/java/org/apache/nutch/fetcher/FetcherJob.java src/java/org/apache/nutch/host/HostDbReader.java src/java/org/apache/nutch/host/HostDbUpdateJob.java src/java/org/apache/nutch/host/HostInjectorJob.java src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java src/java/org/apache/nutch/indexer/elastic/ElasticIndexerJob.java src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java src/java/org/apache/nutch/indexer/solr/SolrIndexerJob.java src/java/org/apache/nutch/parse/ParserChecker.java src/java/org/apache/nutch/parse/ParserJob.java src/java/org/apache/nutch/plugin/PluginRepository.java {code} was (Author: amuseme.lu): already moved the command line parsing to commons-cli,because they are used in bin/nutch command line. {code:java} src/java/org/apache/nutch/api/NutchServer.java src/java/org/apache/nutch/crawl/DbUpdaterJob.java src/java/org/apache/nutch/crawl/GeneratorJob.java src/java/org/apache/nutch/crawl/InjectorJob.java src/java/org/apache/nutch/crawl/WebTableReader.java src/java/org/apache/nutch/fetcher/FetcherJob.java src/java/org/apache/nutch/host/HostDbReader.java src/java/org/apache/nutch/host/HostDbUpdateJob.java src/java/org/apache/nutch/host/HostInjectorJob.java src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java src/java/org/apache/nutch/indexer/elastic/ElasticIndexerJob.java src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java src/java/org/apache/nutch/indexer/solr/SolrIndexerJob.java src/java/org/apache/nutch/parse/ParserChecker.java src/java/org/apache/nutch/parse/ParserJob.java src/java/org/apache/nutch/plugin/PluginRepository.java {code} Move to commons-cli for command line parsing - Key: NUTCH-1555 URL: https://issues.apache.org/jira/browse/NUTCH-1555 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Fix For: 2.2 Attachments: NUTCH-1555.patch I just accidentally passed in the following argument to parser job {code} law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse updatedb ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse:false ParserJob: batchId: updatedb ParserJob: success {code} This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1562) Order of execution for scoring filters
[ https://issues.apache.org/jira/browse/NUTCH-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13637247#comment-13637247 ] lufeng commented on NUTCH-1562: --- Hi Julien, if someone define the scoring.filter.order like opic,depth filters and these filters are not included in plugin.includes property, maybe forget it. it will throw an exception like this. {code:java} java.lang.NullPointerException at org.apache.nutch.scoring.ScoringFilters.injectedScore(ScoringFilters.java:112) at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:164) at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:63) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 2013-04-20 21:19:10,983 ERROR crawl.Injector - Injector: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) {code} Should we consider this situation or not? Order of execution for scoring filters -- Key: NUTCH-1562 URL: https://issues.apache.org/jira/browse/NUTCH-1562 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.6, 2.1 Reporter: Julien Nioche Fix For: 1.7, 2.2 Attachments: NUTCH-1562-trunk.patch The documentation in nutch-default.xml states that : {quote} property namescoring.filter.order/name value/value descriptionThe order in which scoring filters are applied. This may be left empty (in which case all available scoring filters will be applied in the order defined in plugin-includes and plugin-excludes), or a space separated list of implementation classes. /description /property {quote} however if no order is specified the filters are ordered randomly and not in the order defined in plugin-includes. The other *order parameters (e.g. urlfilter.order) have a different documentation and are loaded and applied in system defined order which corresponds to what the code does. The patch attached is for 1.x and puts the code in accordance with the documentation by ordering the filters according to the order of the plugins, which gives users more control without having to specify the classes explicitly in scoring.filter.order. We could extend the same idea to the other *order params. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng reassigned NUTCH-1563: - Assignee: lufeng FetchSchedule#getFields is never used by GeneraterJob - Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1563.patch The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
lufeng created NUTCH-1563: - Summary: FetchSchedule#getFields is never used by GeneraterJob Key: NUTCH-1563 URL: https://issues.apache.org/jira/browse/NUTCH-1563 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.1 Reporter: lufeng Priority: Minor Fix For: 2.2 The method of getFields in FetchSchedule if never used, so if user extends the FetchSchedule and want to get some fields of WebPage, it always return null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1555) Move to commons-cli for command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng reassigned NUTCH-1555: - Assignee: lufeng Move to commons-cli for command line parsing - Key: NUTCH-1555 URL: https://issues.apache.org/jira/browse/NUTCH-1555 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Fix For: 2.2 I just accidentally passed in the following argument to parser job {code} law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse updatedb ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse:false ParserJob: batchId: updatedb ParserJob: success {code} This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1555) bug in 2.x ParserJob command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13627917#comment-13627917 ] lufeng commented on NUTCH-1555: --- Hi Lewis, yes, like you said that we can choose an established CLI framework to enforce more checking. when we use a CLI framework, maybe the command output like this. {code:java} law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse -batchId updatedb ParserJob: starting ParserJob: resuming:false ParserJob: forced reparse: false ParserJob: batchId: updatedb ParserJob: success {code} we can not guarantee that user input parameter values are all correct. or maybe the fast way to fixed this bug is to add -batchId to parse command. but use CLI framework is a good idea, it can let us parsing command line options more easily. I am +1 to port all command line parsing to CLI framework. bug in 2.x ParserJob command line parsing -- Key: NUTCH-1555 URL: https://issues.apache.org/jira/browse/NUTCH-1555 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.1 Reporter: Lewis John McGibbney Fix For: 2.2 I just accidentally passed in the following argument to parser job {code} law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse updatedb ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse:false ParserJob: batchId: updatedb ParserJob: success {code} This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1555) bug in 2.x ParserJob command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13625432#comment-13625432 ] lufeng commented on NUTCH-1555: --- Hi Lewis, as you said that FetchJob also has this bug too. command running result like this {code:java} lemo@debian:~/Workspace/java/apache-workspace/nutch2.x-svn/runtime/local$ bin/nutch fetch updatedb FetcherJob: starting FetcherJob: batchId: updatedb Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 {code} because the type of batchId is a string. bug in 2.x ParserJob command line parsing -- Key: NUTCH-1555 URL: https://issues.apache.org/jira/browse/NUTCH-1555 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.1 Reporter: Lewis John McGibbney Fix For: 2.2 I just accidentally passed in the following argument to parser job {code} law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse updatedb ParserJob: starting ParserJob: resuming: false ParserJob: forced reparse:false ParserJob: batchId: updatedb ParserJob: success {code} This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1545: -- Attachment: NUTCH-1545-v2.patch 1. remove any concept of crawldb and segments in bin/crawl script 2. fix the capture batchID in bin/crawl script through add an argument in GenerateJob class. It will get an batchId if necessary. any comments please. capture batchId and remove references to segments in 2.x crawl script. -- Key: NUTCH-1545 URL: https://issues.apache.org/jira/browse/NUTCH-1545 Project: Nutch Issue Type: Task Affects Versions: 2.1 Reporter: Lewis John McGibbney Assignee: lufeng Priority: Minor Fix For: 2.2 Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch The concept of segment is replaced by batchId in 2.x I'm currently getting rid of segments references in 2.x This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1547) BasicIndexingFilter - Problem to index full title
[ https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1547. --- Resolution: Fixed BasicIndexingFilter - Problem to index full title - Key: NUTCH-1547 URL: https://issues.apache.org/jira/browse/NUTCH-1547 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Gustavo Rauber Assignee: lufeng Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch Original Estimate: 1h Remaining Estimate: 1h I have faced this issue when trying to index the entire title, just like the content, configuring its value on nutch-default.xml to -1 (indexer.max.title.length). I think the behavior should be the same as the content. If you would like to fix it, just replace the line number 90: if (title.length() MAX_TITLE_LENGTH) { // truncate title if needed by this one: if (MAX_TITLE_LENGTH -1 title.length() MAX_TITLE_LENGTH) { // truncate title if needed Stack Trace: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1937) at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Cheers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1547) BasicIndexingFilter - Problem to index full title
[ https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616227#comment-13616227 ] lufeng commented on NUTCH-1547: --- Feng Committed revision 1462078 to trunk and 2.x revision 1462079. BasicIndexingFilter - Problem to index full title - Key: NUTCH-1547 URL: https://issues.apache.org/jira/browse/NUTCH-1547 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Gustavo Rauber Assignee: lufeng Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch Original Estimate: 1h Remaining Estimate: 1h I have faced this issue when trying to index the entire title, just like the content, configuring its value on nutch-default.xml to -1 (indexer.max.title.length). I think the behavior should be the same as the content. If you would like to fix it, just replace the line number 90: if (title.length() MAX_TITLE_LENGTH) { // truncate title if needed by this one: if (MAX_TITLE_LENGTH -1 title.length() MAX_TITLE_LENGTH) { // truncate title if needed Stack Trace: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1937) at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Cheers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up
[ https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616250#comment-13616250 ] lufeng commented on NUTCH-1538: --- yes, However, we can not guarantee that other plugin that extended by user will be use to the corresponding field values in WebPage class. tuning of loaded fields during fetcherJob start-up -- Key: NUTCH-1538 URL: https://issues.apache.org/jira/browse/NUTCH-1538 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 2.1 Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / gora-core 0.2.1 running fetch with parse=true Reporter: Roland von Herget Attachments: NUTCH-1538-FetcherJob-v1.patch Main problem is, nutch is loading nearly every row column from DB during startup of a fetcherJob when fetcher.parse=true. A parserJob needs e.g. the CONTENT field from db, to parse. The fetcherJob adds all fields of the parserJob to it's needed fields, if running with fetcher.parse=true. [FetcherJob.getFields()] If the nutch configuration saves all fetched data to DB (fetcher.store.content=true) you'll end up loading GBs of unused content during fetcherJob start-up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1547) BasicIndexingFilter - Problem to index full title
[ https://issues.apache.org/jira/browse/NUTCH-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1547: -- Attachment: NUTCH-1547-2x.patch add patch to Nutch 2.x BasicIndexingFilter - Problem to index full title - Key: NUTCH-1547 URL: https://issues.apache.org/jira/browse/NUTCH-1547 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.6, 2.1 Reporter: Gustavo Rauber Assignee: lufeng Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1547-2x.patch, NUTCH-1547.patch Original Estimate: 1h Remaining Estimate: 1h I have faced this issue when trying to index the entire title, just like the content, configuring its value on nutch-default.xml to -1 (indexer.max.title.length). I think the behavior should be the same as the content. If you would like to fix it, just replace the line number 90: if (title.length() MAX_TITLE_LENGTH) { // truncate title if needed by this one: if (MAX_TITLE_LENGTH -1 title.length() MAX_TITLE_LENGTH) { // truncate title if needed Stack Trace: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1937) at org.apache.nutch.indexer.basic.BasicIndexingFilter.filter(BasicIndexingFilter.java:91) at org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:109) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:272) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) Cheers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1389) parsechecker and indexchecker to report truncated content
[ https://issues.apache.org/jira/browse/NUTCH-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13615360#comment-13615360 ] lufeng commented on NUTCH-1389: --- +1 Sebstian parsechecker and indexchecker to report truncated content - Key: NUTCH-1389 URL: https://issues.apache.org/jira/browse/NUTCH-1389 Project: Nutch Issue Type: Improvement Components: indexer, parser Affects Versions: nutchgora, 1.5 Reporter: Sebastian Nagel Priority: Minor Fix For: 1.7, 2.2 Attachments: NUTCH-1389-2x.patch, NUTCH-1389-trunk.patch ParserChecker and IndexingFiltersChecker should report when a document is truncated due to {http,file,ftp}.content.limit. Truncated content may cause text and metadata extraction to fail for PDF and other binary document formats. A hint that truncation (and not a broken plugin) is the possible reason would be useful. See NUTCH-965 and {{ParseSegment.isTruncated(content)}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira