[jira] [Commented] (NUTCH-1854) ./bin/crawl fails with a parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485299#comment-14485299 ] lufeng commented on NUTCH-1854: --- if we set "fetcher.store.content=false" and "fetcher.parse=false" then the "bin/nutch parse" command will throw exception to check the input content directory exist. So I think why we need this parameter because something we set the "fetcher.parse" to true and don't want to store the content because of slow disk or not much disk space. So I think we can remove this parameter of "fetcher.store.content" and if the parameter of "fetcher.parse=true" we don't store the page content. > ./bin/crawl fails with a parsing fetcher > > > Key: NUTCH-1854 > URL: https://issues.apache.org/jira/browse/NUTCH-1854 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.11 > > Attachments: NUTCH-1854ver1.patch > > > If you run ./bin/crawl with a parsing fetcher e.g. > > > fetcher.parse > > false > > If true, fetcher will parse content. Default is false, > > which means > > that a separate parsing step is required after fetching is > > finished. > > > we get a horrible message as follows > Exception in thread "main" java.io.IOException: Segment already parsed! > We could improve this by making logging more complete and by adding a trigger > to the crawl script which would check for crawl_parse for a given segment and > then skip parsing if this is present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-1939) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315374#comment-14315374 ] lufeng edited comment on NUTCH-1939 at 2/11/15 2:16 AM: I think that's correct. +1 was (Author: amuseme.lu): Hi Sebastian One question. How do you use the FetchItem returned by "queueRedirect" method. I don't find any code to use this returned object. I think "queueRedirect" method has already add this redirect url back to fetch queue. > Fetcher fails to follow redirects > - > > Key: NUTCH-1939 > URL: https://issues.apache.org/jira/browse/NUTCH-1939 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.9 >Reporter: Sebastian Nagel > Fix For: 1.10 > > Attachments: NUTCH-1939.patch > > > As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with > http.redirect.max > 0 Fetcher does not follow redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315374#comment-14315374 ] lufeng commented on NUTCH-1939: --- Hi Sebastian One question. How do you use the FetchItem returned by "queueRedirect" method. I don't find any code to use this returned object. I think "queueRedirect" method has already add this redirect url back to fetch queue. > Fetcher fails to follow redirects > - > > Key: NUTCH-1939 > URL: https://issues.apache.org/jira/browse/NUTCH-1939 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.9 >Reporter: Sebastian Nagel > Fix For: 1.10 > > Attachments: NUTCH-1939.patch > > > As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with > http.redirect.max > 0 Fetcher does not follow redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1829) Generator : unable to distinguish real errors
[ https://issues.apache.org/jira/browse/NUTCH-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110193#comment-14110193 ] lufeng commented on NUTCH-1829: --- yes, I think we should distinguish different return result using different return code. So we can determine the next action according to this return code. > Generator : unable to distinguish real errors > - > > Key: NUTCH-1829 > URL: https://issues.apache.org/jira/browse/NUTCH-1829 > Project: Nutch > Issue Type: Bug > Components: nutchNewbie >Affects Versions: 1.9, 2.2.1 > Environment: Ubuntu Server 14.04, OpenJDK 7 >Reporter: Mathieu Bouchard > > The bin/nutch generate command is returning the same error code (-1) if there > is an error or no new segment to process, so there is no way to tell if the > error is real or not from a shell script. This problem is related to > NUTCH-1828. > The problem can be fixed by modifying the following Java source file: > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?revision=1619934&view=markup > At line 711, if there are no new segment, the generator returns -1, which is > the same return code returned at line 714 if there was an error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1828) bin/crawl : incorrect handling of nutch errors
[ https://issues.apache.org/jira/browse/NUTCH-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14110177#comment-14110177 ] lufeng commented on NUTCH-1828: --- Can you provide a patch for Nutch 2.x? I found this issue has also effect Nutch2.x. Thanks Mathieu. > bin/crawl : incorrect handling of nutch errors > -- > > Key: NUTCH-1828 > URL: https://issues.apache.org/jira/browse/NUTCH-1828 > Project: Nutch > Issue Type: Bug > Components: nutchNewbie >Affects Versions: 1.9, 2.2.1 > Environment: Ubuntu Server 14.04, OpenJDK 7 >Reporter: Mathieu Bouchard > Attachments: apache-nutch-1.9-crawl-fix-retcode.patch > > > We are using Solr with Nutch to provide a complete search engine for our > website. > I created a cron job that would use Nutch to crawl and update the Solr index > each night. This cron job is trying to automatically correct some errors that > could result in a corrupt crawldb. However, it seems that the bin/crawl > command doesn't correctly propagate errors coming from bin/nutch. > Here is an exemple from the bin/crawl script : > $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR > if [ $? -ne 0 ] > then exit $? > fi > Even if there is an error in the nutch inject command, the crawl script > always returns 0. The way I understand it, the exit code returned is the > result of the shell test and not the result of the nutch inject command. > To correct this, we would need to modify the script with something like : > $bin/nutch inject $CRAWL_PATH/crawldb $SEEDDIR > RETCODE=$? > if [ $RETCODE -ne 0 ] > then exit $RETCODE > fi -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045525#comment-14045525 ] lufeng commented on NUTCH-385: -- Hi Julien I see the description of "fetcher.threads.per.queue" we can add setting "fetcher.threads.per.queue" to value > 1 will also cause "fetcher.server.delay" to be ignore. Another issue is that I think this property "fetcher.max.crawl.delay" is not uniform with "fetcher.server.delay" and "fetcher.server.min.delay". It is changed to "fetcher.server.max.delay" more suitable? > Improve description of thread related configuration for Fetcher > --- > > Key: NUTCH-385 > URL: https://issues.apache.org/jira/browse/NUTCH-385 > Project: Nutch > Issue Type: Bug > Components: documentation, fetcher >Reporter: Chris Schneider >Assignee: Julien Nioche > Fix For: 1.9 > > Attachments: NUTCH-385.patch > > > For some time I've been puzzled by the interaction between two paramters that > control how often the fetcher can access a particular host: > 1) The server delay, which comes back from the remote server during our > processing of the robots.txt file, and which can be limited by > fetcher.max.crawl.delay. > 2) The fetcher.threads.per.host value, particularly when this is greater than > the default of 1. > According to my (limited) understanding of the code in HttpBase.java: > Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher > ends up keeping either 1 or 2 fetcher threads pointing at a particular host > continuously. In other words, it never tries to point 3 at the host, and it > always points a second thread at the host before the first thread finishes > accessing it. Since HttpBase.unblockAddr never gets called with > (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. Thus, the server delay will never be used at all. The fetcher will be > continuously retrieving pages from the host, often with 2 fetchers accessing > the host simultaneously. > Suppose instead that the fetcher finally does allow the last thread to > complete before it gets around to pointing another thread at the target host. > When the last fetcher thread calls HttpBase.unblockAddr, it will now put > System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the > host. This, in turn, will prevent any threads from accessing this host until > the delay is complete, even though zero threads are currently accessing the > host. > I see this behavior as inconsistent. More importantly, the current > implementation certainly doesn't seem to answer my original question about > appropriate definitions for what appear to be conflicting parameters. > In a nutshell, how could we possibly honor the server delay if we allow more > than one fetcher thread to simultaneously access the host? > It would be one thing if whenever (fetcher.threads.per.host > 1), this > trumped the server delay, causing the latter to be ignored completely. That > is certainly not the case in the current implementation, as it will wait for > server delay whenever the number of threads accessing a given host drops to > zero. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010889#comment-14010889 ] lufeng commented on NUTCH-1785: --- +1 elasticsearch 1.2.0 test ok. one question is why convert content byte[] to String type? If one segment contain both html and PDF or mp3 content. How to set this base64 parameter? > Ability to index raw content > > > Key: NUTCH-1785 > URL: https://issues.apache.org/jira/browse/NUTCH-1785 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.9 > > Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, > NUTCH-1785-trunk.patch > > > Some use-cases require Nutch to actually write the raw content a configured > indexing back-end. Since Content is never read, a plugin is out of the > question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (NUTCH-1521) CrawlDbFilter pass null url to urlNormailzers
[ https://issues.apache.org/jira/browse/NUTCH-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1521. - Resolution: Fixed Fix Version/s: (was: 2.4) 1.9 > CrawlDbFilter pass null url to urlNormailzers > - > > Key: NUTCH-1521 > URL: https://issues.apache.org/jira/browse/NUTCH-1521 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Trivial > Fix For: 1.9 > > Attachments: CrawlDbFilter_v1.patch, NUTCH-1521-trunk.patch, > TestCrawlDbFilter.java > > > urlNormalizers will get null url if we set CRAWLDB_PURGE_404, and it will > throw NullPointerException. and the WARN Log will output something like this > "Skipping null NullPointerException". -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969601#comment-13969601 ] lufeng commented on NUTCH-1726: --- Hi all, Can someone free to check this patch? thanks. > HeadingsFilter does not find nested nodes > - > > Key: NUTCH-1726 > URL: https://issues.apache.org/jira/browse/NUTCH-1726 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.9 > > Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, > NUTCH-1726-trunk.patch > > > Filter won't find: > {code} > apache nutch > {code} > The getNodeValue() tries to read data from children but should traverse nodes > instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1752) cache robots.txt rules per protocol:host:port
[ https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964219#comment-13964219 ] lufeng commented on NUTCH-1752: --- Do you mean different port with same protocol and host has different robots.txt file? +1 > cache robots.txt rules per protocol:host:port > - > > Key: NUTCH-1752 > URL: https://issues.apache.org/jira/browse/NUTCH-1752 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.8, 2.2.1 >Reporter: Sebastian Nagel > Fix For: 2.3, 1.9 > > Attachments: NUTCH-1752-v1.patch > > > HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" > (before NUTCH-1031 caching was per "host" only). The caching should be per > "protocol:host:port". In doubt, a request to a different port may deliver a > different {{robots.txt}}. > Applying robots.txt rules to a combination of host, protocol, and port is > common practice: > [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not > mention this explicitly (could be derived from examples) but others do: > * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: "each protocol and > port needs its own robots.txt file" > * [Google > webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]: > "The directives listed in the robots.txt file apply only to the host, > protocol and port number where the file is hosted." -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1733) parse-html to support HTML5 charset definitions
[ https://issues.apache.org/jira/browse/NUTCH-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938867#comment-13938867 ] lufeng commented on NUTCH-1733: --- +1 pass all tests > parse-html to support HTML5 charset definitions > --- > > Key: NUTCH-1733 > URL: https://issues.apache.org/jira/browse/NUTCH-1733 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.8, 2.2.1 >Reporter: Sebastian Nagel > Fix For: 2.3, 1.9 > > Attachments: NUTCH-1733-trunk.patch, charset_bom_html5.html, > charset_html5.html > > > HTML 5 allows to specify the character encoding of a page per > * {{}} > * Unicode Byte Order Mark (BOM) > These are allowed in addition to previous HTTP/http-equiv Content-Type, see > [[1|http://www.w3.org/TR/2011/WD-html5-diff-20110405/#character-encoding]]. > Parse-html ignores both meta charset and BOM, falls back to the default > encoding (cp1252). Parse-tika sets the encoding appropriately. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked
[ https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937426#comment-13937426 ] lufeng commented on NUTCH-1736: --- Hi ysc you can check the content size to fix this issue like this. {code:java} if (http.getMaxContent() >= 0 && (contentBytesRead + chunkLen) > http.getMaxContent() ) chunkLen= http.getMaxContent() - contentBytesRead; {code} > Can't fetch page if http response header contains Transfer-Encoding:chunked > --- > > Key: NUTCH-1736 > URL: https://issues.apache.org/jira/browse/NUTCH-1736 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1 >Reporter: ysc >Priority: Critical > Fix For: 2.3, 1.9 > > Attachments: nutch-2.2.1.patch, nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > fetching: > http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html > Fetch failed with protocol status: EXCEPTION: java.io.IOException: > unzipBestEffort returned null -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked
[ https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937418#comment-13937418 ] lufeng commented on NUTCH-1736: --- Hi Sebastian, I think this patch is not related to NUTCH-1647, maybe they have same exception error result. NUTCH-1647 is about url redirection issue. > Can't fetch page if http response header contains Transfer-Encoding:chunked > --- > > Key: NUTCH-1736 > URL: https://issues.apache.org/jira/browse/NUTCH-1736 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1 >Reporter: ysc >Priority: Critical > Fix For: 2.3, 1.9 > > Attachments: nutch-2.2.1.patch, nutch1.7.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > fetching: > http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html > Fetch failed with protocol status: EXCEPTION: java.io.IOException: > unzipBestEffort returned null -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910355#comment-13910355 ] lufeng edited comment on NUTCH-1726 at 2/24/14 2:41 PM: Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:java} > svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 > cd nutch-svn2 > patch -p0 < NUTCH-1726-trunk.patch > ant > cd src/plugin/headings/ > ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng was (Author: amuseme.lu): Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:bash} > svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 > cd nutch-svn2 > patch -p0 < NUTCH-1726-trunk.patch > ant > cd src/plugin/headings/ > ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng > HeadingsFilter does not find nested nodes > - > > Key: NUTCH-1726 > URL: https://issues.apache.org/jira/browse/NUTCH-1726 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, > NUTCH-1726-trunk.patch > > > Filter won't find: > {code} > apache nutch > {code} > The getNodeValue() tries to read data from children but should traverse nodes > instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910355#comment-13910355 ] lufeng commented on NUTCH-1726: --- Hi Markus It seems that HeadingsFilter does not find nested nodes in my testing code. but I can not restore your testing result when I use following process to testing our patch {code:bash} > svn checkout https://svn.apache.org/repos/asf/nutch/trunk nutch-svn2 > cd nutch-svn2 > patch -p0 < NUTCH-1726-trunk.patch > ant > cd src/plugin/headings/ > ant test {code} everything seems ok. yes, you are right, maybe someone want to ignore long headers. But do we need to set headings.maxlength option to -1 to disable this check, maybe someone want to disable this feature. Feng > HeadingsFilter does not find nested nodes > - > > Key: NUTCH-1726 > URL: https://issues.apache.org/jira/browse/NUTCH-1726 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, > NUTCH-1726-trunk.patch > > > Filter won't find: > {code} > apache nutch > {code} > The getNodeValue() tries to read data from children but should traverse nodes > instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900432#comment-13900432 ] lufeng commented on NUTCH-1726: --- Hi Markus. But I didn't find any error using your newest patch. {code:xml} test: [echo] Testing plugin: headings [junit] Running org.apache.nutch.parse.headings.TestHeadingsParseFilter [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.142 sec BUILD SUCCESSFUL Total time: 3 seconds {code} * maybe you can truncate log headers if it's size is larger than the value of maxlength option. so headings.truncate option can be removed. > HeadingsFilter does not find nested nodes > - > > Key: NUTCH-1726 > URL: https://issues.apache.org/jira/browse/NUTCH-1726 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch, > NUTCH-1726-trunk.patch > > > Filter won't find: > {code} > apache nutch > {code} > The getNodeValue() tries to read data from children but should traverse nodes > instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1726) HeadingsFilter does not find nested nodes
[ https://issues.apache.org/jira/browse/NUTCH-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1726: -- Attachment: NUTCH-1726-trunk-v2.patch add a test case to check HeadingsFilter patch. :) > HeadingsFilter does not find nested nodes > - > > Key: NUTCH-1726 > URL: https://issues.apache.org/jira/browse/NUTCH-1726 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1726-trunk-v2.patch, NUTCH-1726-trunk.patch > > > Filter won't find: > {code} > apache nutch > {code} > The getNodeValue() tries to read data from children but should traverse nodes > instead. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861502#comment-13861502 ] lufeng commented on NUTCH-1691: --- like urlfilter-prefix plugin, we can move WARN code to maintain the code unity. :) > DomainBlacklist url filter does not allow -D filter file override > - > > Key: NUTCH-1691 > URL: https://issues.apache.org/jira/browse/NUTCH-1691 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.8, 2.4 > > Attachments: NUTCH-1691-trunk.patch > > > This filter does not accept -Durlfilter.domainblacklist.file= overrides. The > plugin's file attribute is always used. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages
[ https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861491#comment-13861491 ] lufeng commented on NUTCH-1647: --- yes, but we change check this property in protocol plugins like this {code:java} Response response; if(conf.getInt("http.redirect.max", 3) > 0) response = getResponse(u, datum, true); // make a request and follow redirects else response = getResponse(u,datum,false) {code} so if we define this property, protocol plugins will follow redirects, else not follow redirects. > protocol-http throws unzipBestEffort returned null for some pages > - > > Key: NUTCH-1647 > URL: https://issues.apache.org/jira/browse/NUTCH-1647 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.7 >Reporter: Markus Jelsma > Fix For: 1.8 > > > bin/nutch indexchecker > http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale > Fetch failed with protocol status: exception(16), lastModified=0: > java.io.IOException: unzipBestEffort returned null > {code} > 2013-10-01 13:44:55,612 INFO http.Http - http.proxy.host = null > 2013-10-01 13:44:55,612 INFO http.Http - http.proxy.port = 8080 > 2013-10-01 13:44:55,612 INFO http.Http - http.timeout = 12000 > 2013-10-01 13:44:55,612 INFO http.Http - http.content.limit = 5242880 > 2013-10-01 13:44:55,612 INFO http.Http - http.agent = Mozilla/5.0 > (compatible; OpenindexSpider; > +http://www.openindex.io/en/webmasters/spider.html) > 2013-10-01 13:44:55,612 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2013-10-01 13:44:55,613 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2013-10-01 13:44:55,925 ERROR http.Http - Failed to get protocol output > java.io.IOException: unzipBestEffort returned null > at > org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317) > at > org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:164) > at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140) > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:86) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:150) > {code} > Haven't got a clue yet as to what the exact issue is. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages
[ https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859880#comment-13859880 ] lufeng commented on NUTCH-1647: --- This is cause by return content length is 0. and the unzipBestEffort method return null. {code:java} content = GZIPUtils.unzipBestEffort(compressed); {code} {code:bash} lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ wget --verbose --server-response http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale --2014-01-01 21:47:06-- http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale Resolving www.provinciegroningen.nl (www.provinciegroningen.nl)... 194.13.8.20 Connecting to www.provinciegroningen.nl (www.provinciegroningen.nl)|194.13.8.20|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 301 TYPO3 RealURL redirect Date: Wed, 01 Jan 2014 13:47:22 GMT Server: Apache Location: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale/ Cache-Control: max-age=3600 Expires: Wed, 01 Jan 2014 14:47:22 GMT Vary: Accept-Encoding Content-Length: 0 Content-Type: text/html; charset=UTF-8 Connection: Keep-Alive Set-Cookie: fe_typo_user=56acbb2f413742a928a94ebf51a51bcd; path=/ Age: 0 Location: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale/ [following] --2014-01-01 21:47:13-- http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale/ Reusing existing connection to www.provinciegroningen.nl:80. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Wed, 01 Jan 2014 13:47:22 GMT Server: Apache Cache-Control: max-age=3600 Expires: Wed, 01 Jan 2014 14:47:22 GMT Vary: Accept-Encoding Transfer-Encoding: chunked Content-Type: text/html; charset=utf-8 Connection: Keep-Alive Set-Cookie: fe_typo_user=29b705c75f2c6ff9cf495577efd727dd; path=/ Age: 0 Length: unspecified [text/html] Saving to: `rwe-centrale.2' [ <=> ] 51,728 2.92K/s in 34s 2014-01-01 21:47:48 (1.49 KB/s) - `rwe-centrale.2' saved [51728] {code} if you use httpclient protocol plugin and open follow redirects option, it will download the page correctly. {code:java} lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ bin/nutch indexchecker http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale fetching: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale parsing: http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale contentType: application/xhtml+xml content : Provincie Groningen: RWE-centrale Provincie Groningen > Actueel > Dossiers > RWE-centrale RWE-c title : Provincie Groningen: RWE-centrale host : www.provinciegroningen.nl tstamp :Wed Jan 01 22:03:40 CST 2014 url : http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale {code} but this option is always false setting in HttpBase class. {code:java} public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { String urlString = url.toString(); try { URL u = new URL(urlString); Response response = getResponse(u, datum, false); // make a request {code} so current solution 1. get that option in Configuration file and get that option in getProtocolOuput interface but for protocol-http plugin, we need to write some code to handler url redirect. > protocol-http throws unzipBestEffort returned null for some pages > - > > Key: NUTCH-1647 > URL: https://issues.apache.org/jira/browse/NUTCH-1647 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.7 >Reporter: Markus Jelsma > Fix For: 1.8 > > > bin/nutch indexchecker > http://www.provinciegroningen.nl/actueel/dossiers/rwe-centrale > Fetch failed with protocol status: exception(16), lastModified=0: > java.io.IOException: unzipBestEffort returned null > {code} > 2013-10-01 13:44:55,612 INFO http.Http - http.proxy.host = null > 2013-10-01 13:44:55,612 INFO http.Http - http.proxy.port = 8080 > 2013-10-01 13:44:55,612 INFO http.Http - http.timeout = 12000 > 2013-10-01 13:44:55,612 INFO http.Http - http.content.limit = 5242880 > 2013-10-01 13:44:55,612 INFO http.Http - http.agent = Mozilla/5.0 > (compatible; OpenindexSpider; > +http://www.openindex.io/en/webmasters/spider.html) > 2013-10-01 13:44:55,612 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2013-10-01 13:44:55,613 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2013-10-01 13:44:55,925 ERROR http.Http - Failed to get protocol output > java.io.IOException: unzipBestEffort returned null > at > org.apache.nutch.protocol.http.api.HttpB
[jira] [Commented] (NUTCH-1671) indexchecker to add digest field
[ https://issues.apache.org/jira/browse/NUTCH-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830530#comment-13830530 ] lufeng commented on NUTCH-1671: --- yes, this field can be used by indexing filters. +1 another question is that should we add check code after parse content like this {code:java} ParseResult parseResult = new ParseUtil(conf).parse(content); if (parseResult == null) { LOG.error("Problem with parse - check log"); return (-1); } {code} > indexchecker to add digest field > > > Key: NUTCH-1671 > URL: https://issues.apache.org/jira/browse/NUTCH-1671 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.7, 2.2.1 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1671-2x.patch, NUTCH-1671-trunk.patch > > > IndexingFiltersChecker does not add field "digest" as done by > IndexerMapReduce. Digest/signature could be also used by indexing filters > which then may fail. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1667) Updatedb always ignore batchId
[ https://issues.apache.org/jira/browse/NUTCH-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830525#comment-13830525 ] lufeng commented on NUTCH-1667: --- yes, u are right. +1 > Updatedb always ignore batchId > -- > > Key: NUTCH-1667 > URL: https://issues.apache.org/jira/browse/NUTCH-1667 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.3 >Reporter: Nguyen Manh Tien >Priority: Minor > Attachments: NUTCH-1556-batchId.patch > > > batchId is not set in currentJob because we set batchId after creating > currentJob, so in UpdateDbMapper batchId is null and will be assign to -all. > I change to set batchId befor creating currentJob -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Work started] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1670 started by lufeng. > set same crawldb directory in mergedb parameter > --- > > Key: NUTCH-1670 > URL: https://issues.apache.org/jira/browse/NUTCH-1670 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1670.patch > > > when merge two crawldb using the same crawldb directory in bin/nutch merge > paramater, it will throw data not found exception. > bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 > bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1670) set same crawldb directory in mergedb parameter
[ https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1670: -- Attachment: NUTCH-1670.patch > set same crawldb directory in mergedb parameter > --- > > Key: NUTCH-1670 > URL: https://issues.apache.org/jira/browse/NUTCH-1670 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1670.patch > > > when merge two crawldb using the same crawldb directory in bin/nutch merge > paramater, it will throw data not found exception. > bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 > bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (NUTCH-1670) set same crawldb directory in mergedb parameter
lufeng created NUTCH-1670: - Summary: set same crawldb directory in mergedb parameter Key: NUTCH-1670 URL: https://issues.apache.org/jira/browse/NUTCH-1670 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 when merge two crawldb using the same crawldb directory in bin/nutch merge paramater, it will throw data not found exception. bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2 bin/nutch generate crawldb_t1 segment -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set
[ https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13812840#comment-13812840 ] lufeng commented on NUTCH-1651: --- Hi Lewis yes, the patch is ok, and this a way to set ModifiedTime. +1 > modifiedTime and prevmodifiedTime never set > > > Key: NUTCH-1651 > URL: https://issues.apache.org/jira/browse/NUTCH-1651 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1651.patch > > > modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is > always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime > is set only once in the beginning by zero-control of AdaptiveFetchScheduler. > But this is not sufficient since modifiedTime needs to be updated whenever > last modified time is available. We corrected this with a patch. > Also we noticed that prevModifiedTime is not written to database and we > corrected that too. > With this patch, whenever lastModifiedTime is available, we do two things. > First we set modifiedTime in the Page object to prevModifiedTime. After that > we set lastModifiedTime to modifiedTime. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set
[ https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13809081#comment-13809081 ] lufeng commented on NUTCH-1651: --- Hi Talat yes, u are right, lastModified is a fetch parameter, but this can also be set by parser plugins, because this attribute can also defined by parsers. it's a attribute of WebPage. I don't find any code in Nutch 2.x to set the ModifiedTime in WebPage, also not find in Nutch1.x. very strange. > modifiedTime and prevmodifiedTime never set > > > Key: NUTCH-1651 > URL: https://issues.apache.org/jira/browse/NUTCH-1651 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1651.patch > > > modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is > always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime > is set only once in the beginning by zero-control of AdaptiveFetchScheduler. > But this is not sufficient since modifiedTime needs to be updated whenever > last modified time is available. We corrected this with a patch. > Also we noticed that prevModifiedTime is not written to database and we > corrected that too. > With this patch, whenever lastModifiedTime is available, we do two things. > First we set modifiedTime in the Page object to prevModifiedTime. After that > we set lastModifiedTime to modifiedTime. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1564) AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not modified
[ https://issues.apache.org/jira/browse/NUTCH-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808091#comment-13808091 ] lufeng commented on NUTCH-1564: --- yes, this problem cause by the range of interval value. maybe this delta has also need to limited by a max value , such as MAX_INTERVAL {code:java} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta > interval) interval = delta; if (delta < MIN_INTERVAL) { delta = MIN_INTERVAL; } else if (delta > MAX_INTERVAL) { delta = MAX_INTERVAL; } refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } if (interval < MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval > MAX_INTERVAL) { interval = MAX_INTERVAL; } ... datum.setFetchTime(refTime + Math.round(interval * 1000.0)); {code} so the final fetch time is fetchTime + fetchInterval - delta * SYNC_DELA_RATE = fetchTime + 4.9 day or can we limit the interval after call the setFetchTime method {code:java} if (SYNC_DELTA) { // try to synchronize with the time of change long delta = (fetchTime - modifiedTime) / 1000L; if (delta > interval) interval = delta; refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); } } datum.setFetchTime(refTime + Math.round(interval * 1000.0)); if (interval < MIN_INTERVAL) { interval = MIN_INTERVAL; } else if (interval > MAX_INTERVAL) { interval = MAX_INTERVAL; } datum.setFetchInterval(interval); datum.setModifiedTime(modifiedTime); {code} > AdaptiveFetchSchedule: sync_delta forces immediate refetch for documents not > modified > - > > Key: NUTCH-1564 > URL: https://issues.apache.org/jira/browse/NUTCH-1564 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.6, 2.1 >Reporter: Sebastian Nagel >Priority: Critical > > In a continuous crawl with adaptive fetch scheduling documents not modified > for a longer time are may be fetched in every cycle. > A continous crawl is run daily with a 3 cycles and the following scheduling > intervals (freshness matters): > {code} > db.fetch.schedule.class = org.apache.nutch.crawl.AdaptiveFetchSchedule > db.fetch.schedule.adaptive.sync_delta = true (default) > db.fetch.schedule.adaptive.sync_delta_rate = 0.3 (default) > db.fetch.interval.default = 172800 (2 days) > db.fetch.schedule.adaptive.min_interval = 86400 (1 day) > db.fetch.schedule.adaptive.max_interval = 604800 (7 days) > db.fetch.interval.max = 604800 (7 days) > {code} > At Apr 18 a URL is generated and fetched (from segment dump): > {code} > Crawl Generate:: > Status: 2 (db_fetched) > Fetch time: Mon Apr 15 19:43:22 CEST 2013 > Modified time: Tue Mar 19 01:07:42 CET 2013 > Retries since fetch: 0 > Retry interval: 604800 seconds (7 days) > Crawl Fetch:: > Status: 33 (fetch_success) > Fetch time: Thu Apr 18 01:23:51 CEST 2013 > Modified time: Tue Mar 19 01:07:42 CET 2013 > Retries since fetch: 0 > Retry interval: 604800 seconds (7 days) > {code} > Running CrawlDb update results in a next fetch time in the past (which forces > an immediate refetch in the next cycle): > {code} > Status: 6 (db_notmodified) > Fetch time: Tue Apr 16 01:37:00 CEST 2013 > Modified time: Tue Mar 19 01:07:42 CET 2013 > Retries since fetch: 0 > Retry interval: 604800 seconds (7 days) > {code} > This behavior is caused by the sync_delta calculation in > AdaptiveFetchSchedule: > {code} > if (SYNC_DELTA) { > // try to synchronize with the time of change > long delta = (fetchTime - modifiedTime) / 1000L; > if (delta > interval) interval = delta; > refTime = fetchTime - Math.round(delta * SYNC_DELTA_RATE * 1000); > } > if (interval < MIN_INTERVAL) { > interval = MIN_INTERVAL; > } else if (interval > MAX_INTERVAL) { > interval = MAX_INTERVAL; > } > ... > datum.setFetchTime(refTime + Math.round(interval * 1000.0)); > {code} > {{delta}} is 30 days (Apr 18 - Mar 19). {{refTime}} is then 9 days in the > past ({{delta}} * 0.3). After adding {{interval}} (adjusted to > {{MAX_INTERVAL}} = 7 days) to {{refTime}} the next fetch "should" take place > 2 days in the past (Apr 16). > According to the > [javadoc|http://nutch.apache.org/apidocs-2.1/org/apache/nutch/crawl/AdaptiveFetchSchedule.html] > (if understood right), there are to aims of the sync_delta if we know that a > document hasn't been modified for long: > * increase the fetch interval immediately (not step by step) > * because we expect the document to be changed within the adaptive interval > (but it hasn't), we shift the "reference time", i.e
[jira] [Commented] (NUTCH-1651) modifiedTime and prevmodifiedTime never set
[ https://issues.apache.org/jira/browse/NUTCH-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808045#comment-13808045 ] lufeng commented on NUTCH-1651: --- Hi Talat but I think get last modified from header is not appropriate to put in here. If user want to check the modification of a html in parser plugin through it's content of that url not that metadata in html headers. even the value of "Last-Modified" in headers is changed. {code:java} +Utf8 lastModified = page.getFromHeaders(new Utf8("Last-Modified")); +if ( lastModified != null ){ + try { +modifiedTime = HttpDateFormat.toLong(lastModified.toString()); +prevModifiedTime = page.getModifiedTime(); + } catch (Exception e) { + } +} {code} maybe appropriate way is to let parser plugin defined by user to set the value of modified time not in DbUpdateReducer class. > modifiedTime and prevmodifiedTime never set > > > Key: NUTCH-1651 > URL: https://issues.apache.org/jira/browse/NUTCH-1651 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Talat UYARER > Fix For: 2.3 > > Attachments: NUTCH-1651.patch > > > modifiedTime is never set. If you use DefaultFetchScheduler, modifiedTime is > always zero as default. But if you use AdaptiveFetchScheduler, modifiedTime > is set only once in the beginning by zero-control of AdaptiveFetchScheduler. > But this is not sufficient since modifiedTime needs to be updated whenever > last modified time is available. We corrected this with a patch. > Also we noticed that prevModifiedTime is not written to database and we > corrected that too. > With this patch, whenever lastModifiedTime is available, we do two things. > First we set modifiedTime in the Page object to prevModifiedTime. After that > we set lastModifiedTime to modifiedTime. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1645: -- Attachment: NUTCH-1645-v3.patch 1. add an implementation of reaches a lower number of misses would cause the test to fail 2. improve the code style yes, you are right, this unit test only check for the equality of some "key statistics" as you said. But the problem is how to write test case to verify the correctness of some algorithms in Nutch like AdaptiveFetchSchedule class and find the bug that you pointed in (NUTCH-1564)? Could you give me some suggestions. and I will check the NUTCH-1564 and hope to find a solution to this issue. Thanks Sebastian > Junit Test Case for Adaptive Fetch Schedule class > - > > Key: NUTCH-1645 > URL: https://issues.apache.org/jira/browse/NUTCH-1645 > Project: Nutch > Issue Type: Test >Affects Versions: 2.2.1 >Reporter: Talat UYARER >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch, > NUTCH-1645-v3.patch > > > Currently there is not Test Case for Adaptive Fetch Schedule. Junit test > Writes for its. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1650) Adaptive Fetch Scheduler interval Wrong Set
[ https://issues.apache.org/jira/browse/NUTCH-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787664#comment-13787664 ] lufeng commented on NUTCH-1650: --- yes , this code in Nutch 1.x is correct. +1 > Adaptive Fetch Scheduler interval Wrong Set > --- > > Key: NUTCH-1650 > URL: https://issues.apache.org/jira/browse/NUTCH-1650 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.2.1 >Reporter: Talat UYARER >Priority: Minor > Labels: scheduler > Fix For: 2.3 > > Attachments: NUTCH-1650.patch > > > After calculation interval time when setting it didn't check between max and > min values. Moreover if sync_delta is true. Interval set before changes. > This patch fix this. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1645) Junit Test Case for Adaptive Fetch Schedule class
[ https://issues.apache.org/jira/browse/NUTCH-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1645: -- Attachment: NUTCH-1645-v2.patch add two test case, one is use default parameters and another without open sync delta. thanks Yasin, you can add another test case with some parameter change. > Junit Test Case for Adaptive Fetch Schedule class > - > > Key: NUTCH-1645 > URL: https://issues.apache.org/jira/browse/NUTCH-1645 > Project: Nutch > Issue Type: Test >Affects Versions: 2.2.1 >Reporter: Talat UYARER >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1645.patch, NUTCH-1645-v2.patch > > > Currently there is not Test Case for Adaptive Fetch Schedule. Junit test > Writes for its. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765410#comment-13765410 ] lufeng commented on NUTCH-1556: --- oh, I'm so sorry, I already fixed this problem. commit revision 1522566 in 2.x HEAD. thanks Julien. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1636) Indexer to normalize and filter repr URL
[ https://issues.apache.org/jira/browse/NUTCH-1636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761888#comment-13761888 ] lufeng commented on NUTCH-1636: --- yes, this patch can solve the issue reported by lain. +1 > Indexer to normalize and filter repr URL > > > Key: NUTCH-1636 > URL: https://issues.apache.org/jira/browse/NUTCH-1636 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 1.7 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1636-1.patch > > > Indexer if used with option -normalize and/or -filter (cf. NUTCH-1300) should > also normalize and filter representation URLs. Otherwise URLs which are > target of a redirect, and have repr URL set (see URLUtil.chooseRepr) may show > up in index with an undesirable URL. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1556. --- Resolution: Fixed > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759123#comment-13759123 ] lufeng commented on NUTCH-1556: --- Committed revision 1520332 in 2.x HEAD Thanks kaveh. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756080#comment-13756080 ] lufeng commented on NUTCH-1556: --- I will commit this unless there are objections > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752432#comment-13752432 ] lufeng commented on NUTCH-1556: --- thanks kaveh. +1 > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, > NUTCH-1556-v3.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1556: -- Attachment: NUTCH-1556-v2.patch new patch merged with issue 1632 > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1632) add batchId argument for DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750804#comment-13750804 ] lufeng commented on NUTCH-1632: --- Hi kaveh, I'm sorry and I will close this issue and merge the two patch into one. thanks. > add batchId argument for DbUpdaterJob > - > > Key: NUTCH-1632 > URL: https://issues.apache.org/jira/browse/NUTCH-1632 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 2.2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1632.patch > > > add batchId argument for DbUpdaterJob, you can put the batchId to > DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750803#comment-13750803 ] lufeng commented on NUTCH-1556: --- Hi Lewis, I'm sorry, I generate a duplicate issue. I will merge these two patch into one and can you check this out. thanks. > enabling updatedb to accept batchId > > > Key: NUTCH-1556 > URL: https://issues.apache.org/jira/browse/NUTCH-1556 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: kaveh minooie > Fix For: 2.3 > > Attachments: NUTCH-1556.patch > > > So the idea here is to be able to run updatedb and fetch for different > batchId simultaneously. I put together a patch. it seems to be working ( it > does skip the rows that do not match the batchId), but I am worried if and > how it might affect the sorting in the reduce part. anyway check it out. > it also change the command line usage to this: > Usage: DbUpdaterJob ( | -all) [-crawlId ] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1632) add batchId argument for DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1632. - Resolution: Duplicate > add batchId argument for DbUpdaterJob > - > > Key: NUTCH-1632 > URL: https://issues.apache.org/jira/browse/NUTCH-1632 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 2.2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1632.patch > > > add batchId argument for DbUpdaterJob, you can put the batchId to > DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1632) add batchId argument for DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1632: -- Attachment: NUTCH-1632.patch > add batchId argument for DbUpdaterJob > - > > Key: NUTCH-1632 > URL: https://issues.apache.org/jira/browse/NUTCH-1632 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 2.2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1632.patch > > > add batchId argument for DbUpdaterJob, you can put the batchId to > DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1632) add batchId argument for DbUpdaterJob
lufeng created NUTCH-1632: - Summary: add batchId argument for DbUpdaterJob Key: NUTCH-1632 URL: https://issues.apache.org/jira/browse/NUTCH-1632 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 2.2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.3 add batchId argument for DbUpdaterJob, you can put the batchId to DbUpdaterJob. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749663#comment-13749663 ] lufeng commented on NUTCH-1619: --- Hi Julien,I have already fixed the compilation bug, and I will be pay attention in the next time, thanks for reminding. > Writes Dmoz Description and Title information to db with snippet argument > - > > Key: NUTCH-1619 > URL: https://issues.apache.org/jira/browse/NUTCH-1619 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1 >Reporter: Yasin Kılınç >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch > > > We need Dmoz information of fetched URLs can be written to database. So these > information can be used like snipppet by indexer of the search engine we are > working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749419#comment-13749419 ] lufeng commented on NUTCH-1619: --- I'm so sorry, DataStore may not throw IOException. It has already been fixed. Committed @revision 1517155 in 2.x HEAD > Writes Dmoz Description and Title information to db with snippet argument > - > > Key: NUTCH-1619 > URL: https://issues.apache.org/jira/browse/NUTCH-1619 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1 >Reporter: Yasin Kılınç >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch > > > We need Dmoz information of fetched URLs can be written to database. So these > information can be used like snipppet by indexer of the search engine we are > working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1619. --- Resolution: Fixed > Writes Dmoz Description and Title information to db with snippet argument > - > > Key: NUTCH-1619 > URL: https://issues.apache.org/jira/browse/NUTCH-1619 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1 >Reporter: Yasin Kılınç >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch > > > We need Dmoz information of fetched URLs can be written to database. So these > information can be used like snipppet by indexer of the search engine we are > working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749409#comment-13749409 ] lufeng commented on NUTCH-1619: --- Committed @revision 1517147 in 2.x HEAD Thank you very much Talat for the patch. > Writes Dmoz Description and Title information to db with snippet argument > - > > Key: NUTCH-1619 > URL: https://issues.apache.org/jira/browse/NUTCH-1619 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1 >Reporter: Yasin Kılınç >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch > > > We need Dmoz information of fetched URLs can be written to database. So these > information can be used like snipppet by indexer of the search engine we are > working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1631) Display Document Count Added To Solr Server
[ https://issues.apache.org/jira/browse/NUTCH-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748595#comment-13748595 ] lufeng commented on NUTCH-1631: --- Good statistical methods. +1 > Display Document Count Added To Solr Server > --- > > Key: NUTCH-1631 > URL: https://issues.apache.org/jira/browse/NUTCH-1631 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.1, 2.2, 2.2.1 >Reporter: Furkan KAMACI >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1631.patch > > > Currently you can not see how many documents are added to Solr Server from > Nutch. One should be able to see how many documents are added to Solr Server > simultaneously (as a hadoop counter) and also total document count should be > logged too after all documents are added to Solr Server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747558#comment-13747558 ] lufeng commented on NUTCH-1619: --- Thanks Talat. +1 for commit. > Writes Dmoz Description and Title information to db with snippet argument > - > > Key: NUTCH-1619 > URL: https://issues.apache.org/jira/browse/NUTCH-1619 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1 >Reporter: Yasin Kılınç >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1619.patch, NUTCH-DMOZ-Snippet.patch > > > We need Dmoz information of fetched URLs can be written to database. So these > information can be used like snipppet by indexer of the search engine we are > working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743621#comment-13743621 ] lufeng commented on NUTCH-1619: --- Hi Yasin, Do you forget to close the data store? good. > Writes Dmoz Description and Title information to db with snippet argument > - > > Key: NUTCH-1619 > URL: https://issues.apache.org/jira/browse/NUTCH-1619 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1 >Reporter: Yasin Kılınç >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-DMOZ-Snippet.patch > > > We need Dmoz information of fetched URLs can be written to database. So these > information can be used like snipppet by indexer of the search engine we are > working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.
[ https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13739731#comment-13739731 ] lufeng commented on NUTCH-1294: --- Hi Lewis. Very pleasure. But What can I do something for README.txt? Do you mean I will also change something in https://svn.apache.org/repos/asf/nutch/branches/2.x/README.txt. :) > IndexClean job with solr implementation. > > > Key: NUTCH-1294 > URL: https://issues.apache.org/jira/browse/NUTCH-1294 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora >Reporter: Dan Rosher >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, > NUTCH-1294-v3.patch > > > I started by copying/altering the trunk version of SolrClean, though is was > inadequate for our needs. We needed to mark particular pages as gone even > though they still might be visible on the web, this implementation abstracts > the index cleaning process, has a Solr implementation, and adds a clean index > plugin extension that allows others to tailor how pages might be removed from > their store. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1294) IndexClean job with solr implementation.
[ https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1294. --- Resolution: Fixed > IndexClean job with solr implementation. > > > Key: NUTCH-1294 > URL: https://issues.apache.org/jira/browse/NUTCH-1294 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora >Reporter: Dan Rosher >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, > NUTCH-1294-v3.patch > > > I started by copying/altering the trunk version of SolrClean, though is was > inadequate for our needs. We needed to mark particular pages as gone even > though they still might be visible on the web, this implementation abstracts > the index cleaning process, has a Solr implementation, and adds a clean index > plugin extension that allows others to tailor how pages might be removed from > their store. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.
[ https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738361#comment-13738361 ] lufeng commented on NUTCH-1294: --- Committed @revision 1513549 in 2.x HEAD > IndexClean job with solr implementation. > > > Key: NUTCH-1294 > URL: https://issues.apache.org/jira/browse/NUTCH-1294 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora >Reporter: Dan Rosher >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, > NUTCH-1294-v3.patch > > > I started by copying/altering the trunk version of SolrClean, though is was > inadequate for our needs. We needed to mark particular pages as gone even > though they still might be visible on the web, this implementation abstracts > the index cleaning process, has a Solr implementation, and adds a clean index > plugin extension that allows others to tailor how pages might be removed from > their store. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1294) IndexClean job with solr implementation.
[ https://issues.apache.org/jira/browse/NUTCH-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736978#comment-13736978 ] lufeng commented on NUTCH-1294: --- passed testing with solr 4.2.1. +1 for commit. > IndexClean job with solr implementation. > > > Key: NUTCH-1294 > URL: https://issues.apache.org/jira/browse/NUTCH-1294 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora >Reporter: Dan Rosher >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1294.patch, NUTCH-1294-v2.patch, > NUTCH-1294-v3.patch > > > I started by copying/altering the trunk version of SolrClean, though is was > inadequate for our needs. We needed to mark particular pages as gone even > though they still might be visible on the web, this implementation abstracts > the index cleaning process, has a Solr implementation, and adds a clean index > plugin extension that allows others to tailor how pages might be removed from > their store. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714701#comment-13714701 ] lufeng commented on NUTCH-1613: --- ok, Does this cookie will effect other urls that these urls don't need any cookie and will receive "Bad Request" error when using httpclient? It seems not very general so can we able to add a filter to specify the different host using a different cookie. > Timeouts in protocol-httpclient when crawling same host with >2 threads and > added cookie strings for both http protocols > > > Key: NUTCH-1613 > URL: https://issues.apache.org/jira/browse/NUTCH-1613 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: patch > Fix For: 2.3 > > Attachments: NUTCH-1613.patch > > > 1.) When using protocol-httpclient to crawl a single website (the same host) > I would always get a bunch of timeout errors during fetching and the pages > with errors would not be fetched. E.g.: > 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www > failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: > Timeout waiting for connection > 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www > (queue crawl delay=0ms) > 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following > error: > org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting > for connection > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) > This is because by default the connection pool manager only allows 2 > connections per host so if more than 2 threads are used the others will tend > to time out waiting to get a connection. The code previously set max > connections correctly but not connection per host. > 2.) I also added at the same time simple modifications to both protocol-http > and protocol-httpclient to allow specifying a cookie string in the conf file > to include in request headers. > I use this to crawl site content requiring authentication - it is better for > me to specify the cookie string for the authentication than go through the > whole authentication process and specifying login info. > The nutch-site.xml property is the following: > > http.cookie_string > XX_AL=authorization_value_goes_here > String to use as the cookie value for HTTP > requests > > Although I use it for authentication it can be used to specify any single > cookie string for the crawl (httpclient does support different cookies for > different hosts but I did not get into that). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711150#comment-13711150 ] lufeng commented on NUTCH-1613: --- Does this specified cookie string will effect all crawling urls? > Timeouts in protocol-httpclient when crawling same host with >2 threads and > added cookie strings for both http protocols > > > Key: NUTCH-1613 > URL: https://issues.apache.org/jira/browse/NUTCH-1613 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: patch > Attachments: NUTCH-1613.patch > > > 1.) When using protocol-httpclient to crawl a single website (the same host) > I would always get a bunch of timeout errors during fetching and the pages > with errors would not be fetched. E.g.: > 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www > failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: > Timeout waiting for connection > 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www > (queue crawl delay=0ms) > 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following > error: > org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting > for connection > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) > This is because by default the connection pool manager only allows 2 > connections per host so if more than 2 threads are used the others will tend > to time out waiting to get a connection. The code previously set max > connections correctly but not connection per host. > 2.) I also added at the same time simple modifications to both protocol-http > and protocol-httpclient to allow specifying a cookie string in the conf file > to include in request headers. > I use this to crawl site content requiring authentication - it is better for > me to specify the cookie string for the authentication than go through the > whole authentication process and specifying login info. > The nutch-site.xml property is the following: > > http.cookie_string > XX_AL=authorization_value_goes_here > String to use as the cookie value for HTTP > requests > > Although I use it for authentication it can be used to specify any single > cookie string for the crawl (httpclient does support different cookies for > different hosts but I did not get into that). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700120#comment-13700120 ] lufeng commented on NUTCH-1602: --- Committed in trunk for rev. 1499779. Thanks Markus. > improve the readability of metadata in readdb dump normal > -- > > Key: NUTCH-1602 > URL: https://issues.apache.org/jira/browse/NUTCH-1602 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch > > > the dumped metadata format is not readable. > {code:xml} > $bin/nutch readdb crawldb/ -dump dir > http://www.baidu.com/ Version: 7 > Status: 3 (db_gone) > Fetch time: Sat Aug 17 22:35:37 CST 2013 > Modified time: Thu Jan 01 08:00:00 CST 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 1.0 > Signature: null > Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), > lastModified=0m6: v6 > {code} > so I improve the Metadata format to this > {code:xml} > Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), > lastModified=0;m6=v6; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1602. --- Resolution: Fixed > improve the readability of metadata in readdb dump normal > -- > > Key: NUTCH-1602 > URL: https://issues.apache.org/jira/browse/NUTCH-1602 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch > > > the dumped metadata format is not readable. > {code:xml} > $bin/nutch readdb crawldb/ -dump dir > http://www.baidu.com/ Version: 7 > Status: 3 (db_gone) > Fetch time: Sat Aug 17 22:35:37 CST 2013 > Modified time: Thu Jan 01 08:00:00 CST 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 1.0 > Signature: null > Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), > lastModified=0m6: v6 > {code} > so I improve the Metadata format to this > {code:xml} > Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), > lastModified=0;m6=v6; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1602: -- Attachment: NUTCH-1602-2.patch > improve the readability of metadata in readdb dump normal > -- > > Key: NUTCH-1602 > URL: https://issues.apache.org/jira/browse/NUTCH-1602 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1602-2.patch, NUTCH-1602.patch > > > the dumped metadata format is not readable. > {code:xml} > $bin/nutch readdb crawldb/ -dump dir > http://www.baidu.com/ Version: 7 > Status: 3 (db_gone) > Fetch time: Sat Aug 17 22:35:37 CST 2013 > Modified time: Thu Jan 01 08:00:00 CST 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 1.0 > Signature: null > Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), > lastModified=0m6: v6 > {code} > so I improve the Metadata format to this > {code:xml} > Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), > lastModified=0;m6=v6; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700082#comment-13700082 ] lufeng commented on NUTCH-1602: --- Hi Markus, this output format only used in *normal* output format, not within CSV output format. there are two different crawl datum output format. now the normal output like this, better than previous one. {code:xml} http://www.baidu.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sat Aug 17 22:35:37 CST 2013 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: m1=v22 m3=v3 m2=v2 m5=v5 m4=m4 _pst_=robots_denied(18), lastModified=0 m6=v6 {code} thanks Julien and Tejas. > improve the readability of metadata in readdb dump normal > -- > > Key: NUTCH-1602 > URL: https://issues.apache.org/jira/browse/NUTCH-1602 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1602.patch > > > the dumped metadata format is not readable. > {code:xml} > $bin/nutch readdb crawldb/ -dump dir > http://www.baidu.com/ Version: 7 > Status: 3 (db_gone) > Fetch time: Sat Aug 17 22:35:37 CST 2013 > Modified time: Thu Jan 01 08:00:00 CST 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 1.0 > Signature: null > Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), > lastModified=0m6: v6 > {code} > so I improve the Metadata format to this > {code:xml} > Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), > lastModified=0;m6=v6; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1602) improve the readability of metadata in readdb dump normal
lufeng created NUTCH-1602: - Summary: improve the readability of metadata in readdb dump normal Key: NUTCH-1602 URL: https://issues.apache.org/jira/browse/NUTCH-1602 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.7 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 1.8 the dumped metadata format is not readable. {code:xml} $bin/nutch readdb crawldb/ -dump dir http://www.baidu.com/ Version: 7 Status: 3 (db_gone) Fetch time: Sat Aug 17 22:35:37 CST 2013 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 1.0 Signature: null Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), lastModified=0m6: v6 {code} so I improve the Metadata format to this {code:xml} Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), lastModified=0;m6=v6; {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1602) improve the readability of metadata in readdb dump normal
[ https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1602: -- Attachment: NUTCH-1602.patch > improve the readability of metadata in readdb dump normal > -- > > Key: NUTCH-1602 > URL: https://issues.apache.org/jira/browse/NUTCH-1602 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.7 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1602.patch > > > the dumped metadata format is not readable. > {code:xml} > $bin/nutch readdb crawldb/ -dump dir > http://www.baidu.com/ Version: 7 > Status: 3 (db_gone) > Fetch time: Sat Aug 17 22:35:37 CST 2013 > Modified time: Thu Jan 01 08:00:00 CST 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 1.0 > Signature: null > Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), > lastModified=0m6: v6 > {code} > so I improve the Metadata format to this > {code:xml} > Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), > lastModified=0;m6=v6; > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1600) Injector overwrite does not always work properly
[ https://issues.apache.org/jira/browse/NUTCH-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699034#comment-13699034 ] lufeng commented on NUTCH-1600: --- test work fine. +1 > Injector overwrite does not always work properly > > > Key: NUTCH-1600 > URL: https://issues.apache.org/jira/browse/NUTCH-1600 > Project: Nutch > Issue Type: Bug > Components: injector >Affects Versions: 1.7 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.8 > > Attachments: NUTCH-1600-1.8.patch > > > db.injector.update works as it should but db.injector.overwrite doesn't > always seem to properly overwrite the record. This issue exists for some time > and we've already fixed it in our dist of Nutch. > This record just has been updated (interval). > {code} > Injector: starting at 2013-07-03 10:34:15 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: seeds > Injector: Converting injected urls to crawl db entries. > Injector: total number of urls rejected by filters: 0 > Injector: total number of urls injected after normalization and filtering: 9 > Injector: Merging injected urls into crawl db. > Injector: finished at 2013-07-03 10:34:21, elapsed: 00:00:05 > URL: url > Version: 7 > Status: 2 (db_fetched) > Fetch time: Fri Jul 05 12:11:44 CEST 2013 > Modified time: Fri Jun 28 12:11:44 CEST 2013 > Retries since fetch: 0 > Retry interval: 604800 seconds (7 days) > Score: 0.0 > Signature: ba29ef3e680323a6d0da74c156800e03 > Metadata: Content-Type: text/html_pst_: success(1), lastModified=0 > {code} > If we now overwrite the record, nothing happens. With this patch installed it > overwrites the record as it should and also logs update & overwrite switches > to console: > {code} > Injector: starting at 2013-07-03 10:36:30 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: seeds > Injector: Converting injected urls to crawl db entries. > Injector: total number of urls rejected by filters: 0 > Injector: total number of urls injected after normalization and filtering: 9 > Injector: Merging injected urls into crawl db. > Injector: overwrite: true > Injector: update: false > Injector: finished at 2013-07-03 10:36:36, elapsed: 00:00:05 > URL: url > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Wed Jul 03 10:36:30 CEST 2013 > Modified time: Thu Jan 01 01:00:00 CET 1970 > Retries since fetch: 0 > Retry interval: 14000 seconds (0 days) > Score: 1.0 > Signature: null > Metadata: fixedInterval: 14000.0 > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1581) CrawlDB csv output to include metadata
[ https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696865#comment-13696865 ] lufeng commented on NUTCH-1581: --- I have tested it with nutch 1.x and works fine. +1 > CrawlDB csv output to include metadata > -- > > Key: NUTCH-1581 > URL: https://issues.apache.org/jira/browse/NUTCH-1581 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.8 > > Attachments: NUTCH-1581-1.8.patch > > > Dumping the CrawlDB to CSV should include the CrawlDatum's metadata. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1327) QueryStringNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696854#comment-13696854 ] lufeng commented on NUTCH-1327: --- Hi Markus, I tested you patch, Do you forget to add deploy and test target into src/plugin/build.xml? +1 > QueryStringNormalizer > - > > Key: NUTCH-1327 > URL: https://issues.apache.org/jira/browse/NUTCH-1327 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: NUTCH-1327-1.8-1.patch > > > A normalizer for dealing with query strings. Sorting query strings is helpful > in preventing duplicates for some (bad) websites. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696798#comment-13696798 ] lufeng commented on NUTCH-1594: --- Committed @revision 1498437 in 2.x HEAD. Thanks Canan and Lewis. > count variable is never changed in ParseUtil class > -- > > Key: NUTCH-1594 > URL: https://issues.apache.org/jira/browse/NUTCH-1594 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.2 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1594.patch > > > in ParseUtil class the count variable is never change. the code is like this > for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) > so even if you define the "db.max.outlinks.per.page" parameter, it will not > take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng reassigned NUTCH-1594: - Assignee: lufeng > count variable is never changed in ParseUtil class > -- > > Key: NUTCH-1594 > URL: https://issues.apache.org/jira/browse/NUTCH-1594 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.2 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1594.patch > > > in ParseUtil class the count variable is never change. the code is like this > for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) > so even if you define the "db.max.outlinks.per.page" parameter, it will not > take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1594: -- Attachment: NUTCH-1594.patch > count variable is never changed in ParseUtil class > -- > > Key: NUTCH-1594 > URL: https://issues.apache.org/jira/browse/NUTCH-1594 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.2 >Reporter: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1594.patch > > > in ParseUtil class the count variable is never change. the code is like this > for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) > so even if you define the "db.max.outlinks.per.page" parameter, it will not > take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1594: -- Patch Info: Patch Available > count variable is never changed in ParseUtil class > -- > > Key: NUTCH-1594 > URL: https://issues.apache.org/jira/browse/NUTCH-1594 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.2 >Reporter: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1594.patch > > > in ParseUtil class the count variable is never change. the code is like this > for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) > so even if you define the "db.max.outlinks.per.page" parameter, it will not > take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1594) count variable is never changed in ParseUtil class
[ https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1594: -- Description: in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) so even if you define the "db.max.outlinks.per.page" parameter, it will not take effect. was: in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) Summary: count variable is never changed in ParseUtil class (was: count variable is never in ParseUtil ) > count variable is never changed in ParseUtil class > -- > > Key: NUTCH-1594 > URL: https://issues.apache.org/jira/browse/NUTCH-1594 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.2 >Reporter: lufeng >Priority: Minor > Fix For: 2.3 > > > in ParseUtil class the count variable is never change. the code is like this > for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) > so even if you define the "db.max.outlinks.per.page" parameter, it will not > take effect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1594) count variable is never in ParseUtil
lufeng created NUTCH-1594: - Summary: count variable is never in ParseUtil Key: NUTCH-1594 URL: https://issues.apache.org/jira/browse/NUTCH-1594 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2 Reporter: lufeng Priority: Minor Fix For: 2.3 in ParseUtil class the count variable is never change. the code is like this for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686830#comment-13686830 ] lufeng commented on NUTCH-1527: --- Thanks Markus, I try the patch and can index the document success. +1 for commit. > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma >Priority: Minor > Fix For: 2.4 > > Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, > NUTCH-1527.patch, NUTCH-1527.patch > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13685661#comment-13685661 ] lufeng commented on NUTCH-1527: --- Hi Markus, I have already tested the newest patch on my laptop. very cool. +1 for commit. {code:xml} lemo@debian:~/Workspace/java/apache-workspace/nutch-svn/runtime/local$ bin/nutch index crawldb/ segmetns/20130617225826/ Indexer: starting at 2013-06-17 23:46:47 Indexer: deleting gone documents: false Indexer: URL filtering: false Indexer: URL normalizing: false Active IndexWriters : ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 500) elastic.max.bulk.size : elastic bulk index length. (default 5001001 ~5MB) Processing remaining requests [docs = 1, length = 7528, total docs = 1] Processing to finalize last execute Previous took in ms 27, including wait 21 Indexer: finished at 2013-06-17 23:46:57, elapsed: 00:00:10 {code} but one question is that should we add elastic.cluster and elastic.index properties into the nutch-default.xml file? > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma >Priority: Minor > Fix For: 2.4 > > Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch, > NUTCH-1527.patch > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682380#comment-13682380 ] lufeng commented on NUTCH-1527: --- Hi Markus 1. Elastic search will load the configure file first, so you need to add config/elasticsearch.yml in your runtime/local/config. But I don't find any method to load configure file with configuration. 2. do you still have lucene-core-3.4.jar in you runtime/local/lib directory? or do you add this {code:xml} + {code} code in ivy/ivy.xml file. maybe the elasticsearch can not load class in nutch plugins system. > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma >Priority: Minor > Fix For: 2.4 > > Attachments: NUTCH-1527.patch, NUTCH-1527.patch, NUTCH-1527.patch > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1575. - > support solr authentication in nutch 2.x > > > Key: NUTCH-1575 > URL: https://issues.apache.org/jira/browse/NUTCH-1575 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1575.patch > > > can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1545. --- Resolution: Fixed > capture batchId and remove references to segments in 2.x crawl script. > -- > > Key: NUTCH-1545 > URL: https://issues.apache.org/jira/browse/NUTCH-1545 > Project: Nutch > Issue Type: Task >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch > > > The concept of segment is replaced by batchId in 2.x > I'm currently getting rid of segments references in 2.x > This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1545: -- Fix Version/s: (was: 2.3) 2.2 > capture batchId and remove references to segments in 2.x crawl script. > -- > > Key: NUTCH-1545 > URL: https://issues.apache.org/jira/browse/NUTCH-1545 > Project: Nutch > Issue Type: Task >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch > > > The concept of segment is replaced by batchId in 2.x > I'm currently getting rid of segments references in 2.x > This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13670376#comment-13670376 ] lufeng commented on NUTCH-1545: --- Committed for nutch 2.2 revision 1487875. by Feng. Thanks Tejas and Lewis. > capture batchId and remove references to segments in 2.x crawl script. > -- > > Key: NUTCH-1545 > URL: https://issues.apache.org/jira/browse/NUTCH-1545 > Project: Nutch > Issue Type: Task >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch > > > The concept of segment is replaced by batchId in 2.x > I'm currently getting rid of segments references in 2.x > This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1575. --- Resolution: Fixed > support solr authentication in nutch 2.x > > > Key: NUTCH-1575 > URL: https://issues.apache.org/jira/browse/NUTCH-1575 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1575.patch > > > can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669351#comment-13669351 ] lufeng commented on NUTCH-1575: --- Committed for 2.2 revision 1487521 by Feng. Thanks Lewis > support solr authentication in nutch 2.x > > > Key: NUTCH-1575 > URL: https://issues.apache.org/jira/browse/NUTCH-1575 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1575.patch > > > can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng resolved NUTCH-1563. --- Resolution: Fixed > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng closed NUTCH-1563. - > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667775#comment-13667775 ] lufeng commented on NUTCH-1527: --- Hi luca, now you can click assign to me,and then attach you improvement patch, thanks luca. > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: 2.4 > > Attachments: NUTCH-1527.patch > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1527: -- Assignee: (was: lufeng) > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: 2.4 > > Attachments: NUTCH-1527.patch > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667766#comment-13667766 ] lufeng commented on NUTCH-1527: --- Hi luca,sorry for my delayed reply, yes, you can improve this patch follow you suggestion, can I assign this issue to you, I am willing to testing it. Thanks. Luca. -- Don't Grow Old, Grow Up... :-) > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.4 > > Attachments: NUTCH-1527.patch > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1563: -- Fix Version/s: (was: 2.3) 2.2 > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665161#comment-13665161 ] lufeng commented on NUTCH-1563: --- hi Tejas yes, I pushed this pathc to 2.x. https://svn.apache.org/repos/asf/nutch/branches/2.x Do you mean that I pushed to wrong place? > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1575: -- Attachment: NUTCH-1575.patch add solr authentication > support solr authentication in nutch 2.x > > > Key: NUTCH-1575 > URL: https://issues.apache.org/jira/browse/NUTCH-1575 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1575.patch > > > can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (NUTCH-1575) support solr authentication in nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1575 started by lufeng. > support solr authentication in nutch 2.x > > > Key: NUTCH-1575 > URL: https://issues.apache.org/jira/browse/NUTCH-1575 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > > can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1575) support solr authentication in nutch 2.x
lufeng created NUTCH-1575: - Summary: support solr authentication in nutch 2.x Key: NUTCH-1575 URL: https://issues.apache.org/jira/browse/NUTCH-1575 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.1 Reporter: lufeng Assignee: lufeng Priority: Minor Fix For: 2.2 can solr authentication in nutch 2.x like 1.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.
[ https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662057#comment-13662057 ] lufeng commented on NUTCH-1545: --- Hi Tejas yes, the patch is just put random batchId generater from code to crawl script. User don't have to bother this. > capture batchId and remove references to segments in 2.x crawl script. > -- > > Key: NUTCH-1545 > URL: https://issues.apache.org/jira/browse/NUTCH-1545 > Project: Nutch > Issue Type: Task >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch > > > The concept of segment is replaced by batchId in 2.x > I'm currently getting rid of segments references in 2.x > This issue was flagged up and separate from NUTCH-1532 which I am working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1563: -- Fix Version/s: (was: 2.3) 2.2 > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob
[ https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662003#comment-13662003 ] lufeng commented on NUTCH-1563: --- Committed for 2.2 revision 1484482 by Feng. Thanks Canan and Lewis. > FetchSchedule#getFields is never used by GeneraterJob > - > > Key: NUTCH-1563 > URL: https://issues.apache.org/jira/browse/NUTCH-1563 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 2.1 >Reporter: lufeng >Assignee: lufeng >Priority: Minor > Fix For: 2.2 > > Attachments: NUTCH-1563.patch > > > The method of getFields in FetchSchedule if never used, so if user extends > the FetchSchedule and want to get some fields of WebPage, it always return > null. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1527: -- Attachment: NUTCH-1527.patch port elasticsearch indexer plugin to nutch trunk. Before u install this patch, you need to install the https://issues.apache.org/jira/browse/NUTCH-1486 first. so I use the newest version of elasticsearch 0.90.0. It use the lucene 4.2.1. I need more testing about this patch, I am a newbie to elastchsearch. Hope any comments about this patch. thanks Lewis. > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1527.patch > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1486) Upgrade to Solr 4.2.1
[ https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651966#comment-13651966 ] lufeng commented on NUTCH-1486: --- and the version of lucene-core and solr-solrj in plugin.xml at indexer-solr directory is still 3.4.0. > Upgrade to Solr 4.2.1 > - > > Key: NUTCH-1486 > URL: https://issues.apache.org/jira/browse/NUTCH-1486 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6, 2.1 > Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT & Probably 2.2-SNAPHOT >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, > NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, NUTCH-1486-trunk.v2.patch > > > When attempting to configure a 4 multicore 4.0 instance with Nutch > schema-solr4.xml file, I get the following exceptions. > This has been discussed previously. As I see it we have two options > 1. Keep maintaining both schema options > 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml > Thoughts? > {code} > SEVERE: Unable to create core: collection4 > org.apache.solr.common.SolrException: Unable to use updateLog: _version_field > must exist in schema, using indexed="true" stored="true" and > multiValued="false" (_version_ does not exist) > at org.apache.solr.core.SolrCore.(SolrCore.java:721) > at org.apache.solr.core.SolrCore.(SolrCore.java:566) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850) > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534) > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) > at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308) > at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107) > at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754) > at > org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258) > at > org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221) > at > org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699) > at > org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36) > at > org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183) > at > org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491) > at > org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138) > at > org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142) > at > org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53) > at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604) > at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535) > at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398) > at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552) > at > org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63) > at > org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53) > at > org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91) > at org.eclipse.jetty.server.Server.doStart(Server.java:263) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215) > at java.security.AccessController.doPrivileged(Native Method) > at > org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java
[jira] [Assigned] (NUTCH-1527) Port nutch-elasticsearch-indexer to Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng reassigned NUTCH-1527: - Assignee: lufeng > Port nutch-elasticsearch-indexer to Nutch > - > > Key: NUTCH-1527 > URL: https://issues.apache.org/jira/browse/NUTCH-1527 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.6, 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng >Priority: Minor > Fix For: 2.3, 1.8 > > > The source repos for this can be found here [0]. > This issue should be inline with the work already done by Julien and others > over at NUTCH-1047. > [0] https://github.com/ctjmorgan/nutch-elasticsearch-indexer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1486) Upgrade to Solr 4.2.1
[ https://issues.apache.org/jira/browse/NUTCH-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651936#comment-13651936 ] lufeng commented on NUTCH-1486: --- Hi Lewis The dependency version of solr-solrj in pom.xml is still 3.1.0. Should we upgrade it to 4.2.1. > Upgrade to Solr 4.2.1 > - > > Key: NUTCH-1486 > URL: https://issues.apache.org/jira/browse/NUTCH-1486 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.6, 2.1 > Environment: Solr 4.0, Nutch trunk 1.6-SNAPSHOT & Probably 2.2-SNAPHOT >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1486-2.x.patch, NUTCH-1486-2.x.v2.patch, > NUTCH-1486-nutchgora.patch, NUTCH-1486-trunk.patch, NUTCH-1486-trunk.v2.patch > > > When attempting to configure a 4 multicore 4.0 instance with Nutch > schema-solr4.xml file, I get the following exceptions. > This has been discussed previously. As I see it we have two options > 1. Keep maintaining both schema options > 2. Ditch the more complex schema-solr4.xml in favour of vanilla schema.xml > Thoughts? > {code} > SEVERE: Unable to create core: collection4 > org.apache.solr.common.SolrException: Unable to use updateLog: _version_field > must exist in schema, using indexed="true" stored="true" and > multiValued="false" (_version_ does not exist) > at org.apache.solr.core.SolrCore.(SolrCore.java:721) > at org.apache.solr.core.SolrCore.(SolrCore.java:566) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:850) > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:534) > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:356) > at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:308) > at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:107) > at org.eclipse.jetty.servlet.FilterHolder.doStart(FilterHolder.java:114) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:754) > at > org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:258) > at > org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1221) > at > org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:699) > at > org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:454) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.deploy.bindings.StandardStarter.processBinding(StandardStarter.java:36) > at > org.eclipse.jetty.deploy.AppLifeCycle.runBindings(AppLifeCycle.java:183) > at > org.eclipse.jetty.deploy.DeploymentManager.requestAppGoal(DeploymentManager.java:491) > at > org.eclipse.jetty.deploy.DeploymentManager.addApp(DeploymentManager.java:138) > at > org.eclipse.jetty.deploy.providers.ScanningAppProvider.fileAdded(ScanningAppProvider.java:142) > at > org.eclipse.jetty.deploy.providers.ScanningAppProvider$1.fileAdded(ScanningAppProvider.java:53) > at org.eclipse.jetty.util.Scanner.reportAddition(Scanner.java:604) > at org.eclipse.jetty.util.Scanner.reportDifferences(Scanner.java:535) > at org.eclipse.jetty.util.Scanner.scan(Scanner.java:398) > at org.eclipse.jetty.util.Scanner.doStart(Scanner.java:332) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.deploy.providers.ScanningAppProvider.doStart(ScanningAppProvider.java:118) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.deploy.DeploymentManager.startAppProvider(DeploymentManager.java:552) > at > org.eclipse.jetty.deploy.DeploymentManager.doStart(DeploymentManager.java:227) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:63) > at > org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:53) > at > org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:91) > at org.eclipse.jetty.server.Server.doStart(Server.java:263) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:59) > at > org.eclipse.jetty.xml.XmlConfiguration$1.run(XmlConfiguration.java:1215) > at java.security.AccessController.doPrivileged(Native Method) > at > org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.jav
[jira] [Comment Edited] (NUTCH-1555) Move to commons-cli for command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641869#comment-13641869 ] lufeng edited comment on NUTCH-1555 at 4/25/13 2:48 PM: Lewis: 1. fixed the fetch NPE bug 2. fixed the update not work bug Should we put every tools to use commons-cli? I find that there are 47 files need to be moved. Sebastian: 1. use eclipse-codeformat.xml to format the source code Thanks Lewis and Sebastian. was (Author: amuseme.lu): Lewis: 1. fixed the fetch NPE bug 2. fixed the update not work bug Should we put every tools to use commons-cli? I find that there are 47 files need to be moved. [~wastl-nagel] 1. use eclipse-codeformat.xml to format the source code Thanks Lewis and Sebastian. > Move to commons-cli for command line parsing > - > > Key: NUTCH-1555 > URL: https://issues.apache.org/jira/browse/NUTCH-1555 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng > Fix For: 2.2 > > Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch > > > I just accidentally passed in the following argument to parser job > {code} > law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse > updatedb > ParserJob: starting > ParserJob: resuming: false > ParserJob: forced reparse:false > ParserJob: batchId: updatedb > ParserJob: success > {code} > This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1555) Move to commons-cli for command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1555: -- Attachment: NUTCH-1555-v1.patch Lewis: 1. fixed the fetch NPE bug 2. fixed the update not work bug Should we put every tools to use commons-cli? I find that there are 47 files need to be moved. [~wastl-nagel] 1. use eclipse-codeformat.xml to format the source code Thanks Lewis and Sebastian. > Move to commons-cli for command line parsing > - > > Key: NUTCH-1555 > URL: https://issues.apache.org/jira/browse/NUTCH-1555 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.1 >Reporter: Lewis John McGibbney >Assignee: lufeng > Fix For: 2.2 > > Attachments: NUTCH-1555.patch, NUTCH-1555-v1.patch > > > I just accidentally passed in the following argument to parser job > {code} > law@CEE279Law3-Linux:~/Downloads/asf/2.x/runtime/local$ ./bin/nutch parse > updatedb > ParserJob: starting > ParserJob: resuming: false > ParserJob: forced reparse:false > ParserJob: batchId: updatedb > ParserJob: success > {code} > This is a bug for sure -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira