[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892075#comment-13892075 ] Koen Smets commented on NUTCH-1556: --- Should be reconsidered as this causes a lot of already fetched pages as indicated by NUTCH-1679. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892364#comment-13892364 ] Lewis John McGibbney commented on NUTCH-1556: - Hi [~ksmets], this issue has been addressed and resolved. Do you care to submit your comments or a patch for NUTCH-1679? Also can you explain more verbosely what you mean by bq. this causes a lot of refetched pages Do you mean that this causes a lot of fetched pages to be overwritten and refetched? enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892423#comment-13892423 ] Koen Smets commented on NUTCH-1556: --- Hi [~lewismc], I confirmed NUTCH-1679 on Cassandra store. Although, the `bin/crawl` script changed in [NUTCH-1556] only by adding the `$batchId` from the one in 2.2.1, this changes behaviour drastically. A lot of pages get refetched sooner than indicated by db.default.fetch.interval and outnumber the unfetched pages. Although I noticed the remarkable speedup when focusing only on the pages from the current batch, I changed `$batchId` to `-all` in order to give preference to the pages that are truly unfetched. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839240#comment-13839240 ] Otis Gospodnetic commented on NUTCH-1556: - [~tiennm] it looks like you added a patch to this issue, but the issue is already marked Resolved and Fixed. [~amuseme.lu], want to commit this new patch before 2.3 release? enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch, NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765243#comment-13765243 ] Julien Nioche commented on NUTCH-1556: -- Guys, you have broken the crawl script for 2.x. The usage is Usage: DbUpdaterJob (batchId | -all) [-crawlId id] but you are passing '-batchId $batchId'. Can you please fix this? Thanks enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765410#comment-13765410 ] lufeng commented on NUTCH-1556: --- oh, I'm so sorry, I already fixed this problem. commit revision 1522566 in 2.x HEAD. thanks Julien. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765413#comment-13765413 ] Julien Nioche commented on NUTCH-1556: -- No probs Lufeng. Thanks! enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765426#comment-13765426 ] Hudson commented on NUTCH-1556: --- FAILURE: Integrated in Nutch-nutchgora #754 (See [https://builds.apache.org/job/Nutch-nutchgora/754/]) NUTCH-1556 enabling updatedb to accept batchId (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1522566) * /nutch/branches/2.x/src/bin/crawl enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759123#comment-13759123 ] lufeng commented on NUTCH-1556: --- Committed revision 1520332 in 2.x HEAD Thanks kaveh. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759168#comment-13759168 ] Hudson commented on NUTCH-1556: --- SUCCESS: Integrated in Nutch-nutchgora #746 (See [https://builds.apache.org/job/Nutch-nutchgora/746/]) NUTCH-1556 enabling updatedb to accept batchId (fenglu: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1520332) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/bin/crawl * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateMapper.java * /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdaterJob.java enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756080#comment-13756080 ] lufeng commented on NUTCH-1556: --- I will commit this unless there are objections enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752432#comment-13752432 ] lufeng commented on NUTCH-1556: --- thanks kaveh. +1 enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, NUTCH-1556-v3.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750394#comment-13750394 ] Lewis John McGibbney commented on NUTCH-1556: - It would be real nice to merge the proposal on both NUTCH-1556 and NUTCH-1632 enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750803#comment-13750803 ] lufeng commented on NUTCH-1556: --- Hi Lewis, I'm sorry, I generate a duplicate issue. I will merge these two patch into one and can you check this out. thanks. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.3 Attachments: NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630657#comment-13630657 ] Lewis John McGibbney commented on NUTCH-1556: - Nice one Kaveh. I will check this out soon. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.2 Attachments: NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira