[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2014-02-05 Thread Koen Smets (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892075#comment-13892075
 ] 

Koen Smets commented on NUTCH-1556:
---

Should be reconsidered as this causes a lot of already fetched pages as 
indicated by NUTCH-1679.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2014-02-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892364#comment-13892364
 ] 

Lewis John McGibbney commented on NUTCH-1556:
-

Hi [~ksmets], this issue has been addressed and resolved. 
Do you care to submit your comments or a patch for NUTCH-1679?
Also can you explain more verbosely what you mean by 
bq. this causes a lot of refetched pages
Do you mean that this causes a lot of fetched pages to be overwritten and 
refetched?

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2014-02-05 Thread Koen Smets (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13892423#comment-13892423
 ] 

Koen Smets commented on NUTCH-1556:
---

Hi [~lewismc], I confirmed NUTCH-1679 on Cassandra store.

Although, the `bin/crawl` script changed in [NUTCH-1556] only by adding the 
`$batchId` from the one in 2.2.1, this changes behaviour drastically. A lot of 
pages get refetched sooner than indicated by db.default.fetch.interval and 
outnumber the unfetched pages. 

Although I noticed the remarkable speedup when focusing only on the pages from 
the current batch, I changed `$batchId` to `-all` in order to give preference 
to the pages that are truly unfetched.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-12-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839240#comment-13839240
 ] 

Otis Gospodnetic commented on NUTCH-1556:
-

[~tiennm] it looks like you added a patch to this issue, but the issue is 
already marked Resolved and Fixed.

[~amuseme.lu], want to commit this new patch before 2.3 release?

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556-batchId.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch, NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-12 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765243#comment-13765243
 ] 

Julien Nioche commented on NUTCH-1556:
--

Guys, 

you have broken the crawl script for 2.x. The usage is 

Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

but you are passing '-batchId $batchId'. Can you please fix this?

Thanks

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-12 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765410#comment-13765410
 ] 

lufeng commented on NUTCH-1556:
---

oh, I'm so sorry, I already fixed this problem.

commit revision 1522566 in 2.x HEAD.

thanks Julien.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-12 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765413#comment-13765413
 ] 

Julien Nioche commented on NUTCH-1556:
--

No probs Lufeng. Thanks!

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765426#comment-13765426
 ] 

Hudson commented on NUTCH-1556:
---

FAILURE: Integrated in Nutch-nutchgora #754 (See 
[https://builds.apache.org/job/Nutch-nutchgora/754/])
NUTCH-1556 enabling updatedb to accept batchId (fenglu: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1522566)
* /nutch/branches/2.x/src/bin/crawl


 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-05 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759123#comment-13759123
 ] 

lufeng commented on NUTCH-1556:
---

Committed revision 1520332 in 2.x HEAD
Thanks kaveh. 

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13759168#comment-13759168
 ] 

Hudson commented on NUTCH-1556:
---

SUCCESS: Integrated in Nutch-nutchgora #746 (See 
[https://builds.apache.org/job/Nutch-nutchgora/746/])
NUTCH-1556 enabling updatedb to accept batchId (fenglu: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1520332)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/bin/crawl
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateMapper.java
* /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdaterJob.java


 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-09-02 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756080#comment-13756080
 ] 

lufeng commented on NUTCH-1556:
---

I will commit this unless there are objections

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752432#comment-13752432
 ] 

lufeng commented on NUTCH-1556:
---

thanks kaveh. +1

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch, NUTCH-1556-v2.patch, 
 NUTCH-1556-v3.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750394#comment-13750394
 ] 

Lewis John McGibbney commented on NUTCH-1556:
-

It would be real nice to merge the proposal on both NUTCH-1556 and NUTCH-1632

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-08-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750803#comment-13750803
 ] 

lufeng commented on NUTCH-1556:
---

Hi Lewis, I'm sorry, I generate a duplicate issue. I will merge these two patch 
into one and can you check this out. thanks.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.3

 Attachments: NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-04-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630657#comment-13630657
 ] 

Lewis John McGibbney commented on NUTCH-1556:
-

Nice one Kaveh. I will check this out soon.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.2

 Attachments: NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira