[Nutch-dev] [jira] Created: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-05-22 Thread Emmanuel Joke (JIRA)
Avoid parsing uneccessary links and get a more relevant outlink list Key: NUTCH-488 URL: https://issues.apache.org/jira/browse/NUTCH-488 Project: Nutch Issue Type:

[Nutch-dev] [jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-05-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-488: Attachment: DOMContentUtils.patch Avoid parsing uneccessary links and get a more relevant outlink

[Nutch-dev] [jira] Updated: (NUTCH-488) Avoid parsing uneccessary links and get a more relevant outlink list

2007-05-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-488: Attachment: nutch-default.xml.patch Avoid parsing uneccessary links and get a more relevant

[Nutch-dev] [jira] Created: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-22 Thread Emmanuel Joke (JIRA)
URLFilter-suffix management of the url path when the url contains some query parameters --- Key: NUTCH-489 URL: https://issues.apache.org/jira/browse/NUTCH-489

[Nutch-dev] [jira] Updated: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-489: Attachment: SuffixURLFilter.java.patch suffix-urlfilter.txt.patch URLFilter-suffix

[Nutch-dev] [jira] Updated: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-489: Attachment: SuffixURLFilter_v2.java.patch My mistake... I've added a new patchwhich is supposed

[Nutch-dev] [jira] Created: (NUTCH-500) Add hadoop masters configuration file into conf folder

2007-06-18 Thread Emmanuel Joke (JIRA)
Add hadoop masters configuration file into conf folder -- Key: NUTCH-500 URL: https://issues.apache.org/jira/browse/NUTCH-500 Project: Nutch Issue Type: Improvement Components:

[Nutch-dev] [jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-21 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506922 ] Emmanuel Joke commented on NUTCH-503: - I just try your patch and i'm afraid I still have the same issue.

[Nutch-dev] [jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507469 ] Emmanuel Joke commented on NUTCH-503: - Sorry, my mistake. My compiled jar was not correctly included in my

[Nutch-dev] [jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-29 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509039 ] Emmanuel Joke commented on NUTCH-503: - Results seems to good. So I'm wondering if it is possible to commit this

[Nutch-dev] [jira] Created: (NUTCH-507) lib-lucene-analyzers jar defintion is wrong in plugin.xml

2007-07-07 Thread Emmanuel Joke (JIRA)
lib-lucene-analyzers jar defintion is wrong in plugin.xml - Key: NUTCH-507 URL: https://issues.apache.org/jira/browse/NUTCH-507 Project: Nutch Issue Type: Bug Environment:

[Nutch-dev] [jira] Created: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-08 Thread Emmanuel Joke (JIRA)
Update Crawldb: avoid to start a job if there is no valid segment - Key: NUTCH-509 URL: https://issues.apache.org/jira/browse/NUTCH-509 Project: Nutch Issue Type: Improvement

[Nutch-dev] [jira] Closed: (NUTCH-509) Update Crawldb: avoid to start a job if there is no valid segment

2007-07-09 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke closed NUTCH-509. --- Resolution: Won't Fix As explain by Doğacan, the Crawldb update has a good behaviour. This patch is

[Nutch-dev] [jira] Created: (NUTCH-516) Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE

2007-07-17 Thread Emmanuel Joke (JIRA)
Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE Key: NUTCH-516 URL: https://issues.apache.org/jira/browse/NUTCH-516 Project: Nutch Issue Type: Bug

[Nutch-dev] [jira] Updated: (NUTCH-516) Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE

2007-07-18 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-516: Attachment: NUTCH-516.patch I fxied the issue by changing the FetchTime in

[Nutch-dev] [jira] Created: (NUTCH-522) Use URLValidator in the Injector

2007-07-19 Thread Emmanuel Joke (JIRA)
Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke

[Nutch-dev] [jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-19 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: NUTCH-522.patch Patch provided Use URLValidator in the Injector

[Nutch-dev] [jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-19 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: NUTCH-522_v2.patch Oops, my mistake. Please find an updated patch. Actually I've a

[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-20 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514153 ] Emmanuel Joke commented on NUTCH-522: - Actually I tried to fetch the url

[Nutch-dev] [jira] Updated: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-24 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-526: Attachment: NUTCH-526.patch patch provided Use a combiner in LinDbMerger to improve the

[Nutch-dev] [jira] Created: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-24 Thread Emmanuel Joke (JIRA)
Use a combiner in LinDbMerger to improve the performance as in LinkDb - Key: NUTCH-526 URL: https://issues.apache.org/jira/browse/NUTCH-526 Project: Nutch Issue Type:

[Nutch-dev] [jira] Updated: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-07-26 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-528: Attachment: NUTCH-528.patch patch attached CrawlDbReader: add some new stats + dump into a csv

[Nutch-dev] [jira] Created: (NUTCH-528) CrawlDbReader: add some new stats + dump into a csv format

2007-07-26 Thread Emmanuel Joke (JIRA)
CrawlDbReader: add some new stats + dump into a csv format -- Key: NUTCH-528 URL: https://issues.apache.org/jira/browse/NUTCH-528 Project: Nutch Issue Type: Improvement

[Nutch-dev] [jira] Created: (NUTCH-529) NodeWalker.skipChildren don't wrok for more than 1 child.

2007-07-26 Thread Emmanuel Joke (JIRA)
NodeWalker.skipChildren don't wrok for more than 1 child. - Key: NUTCH-529 URL: https://issues.apache.org/jira/browse/NUTCH-529 Project: Nutch Issue Type: Bug Reporter:

[Nutch-dev] [jira] Updated: (NUTCH-529) NodeWalker.skipChildren don't wrok for more than 1 child.

2007-07-26 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-529: Attachment: NUTCH-529.patch patch attached NodeWalker.skipChildren don't wrok for more than 1

[Nutch-dev] [jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: NUTCH-522_v3.patch Use URLValidator in the Injector

[Nutch-dev] [jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: NUTCH-522_v3.patch commons-validator's UrlValidator does not filter URLS with space.

[Nutch-dev] [jira] Updated: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-522: Attachment: (was: NUTCH-522_v3.patch) Use URLValidator in the Injector

[Nutch-dev] [jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-07-27 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516138 ] Emmanuel Joke commented on NUTCH-522: - I tried with protocol-http and protocol-httpclient, i got the same error

[Nutch-dev] [jira] Commented: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-29 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516209 ] Emmanuel Joke commented on NUTCH-526: - Actually i made a simple test on 2 small linkdb, and i didn't see any

[Nutch-dev] [jira] Created: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-29 Thread Emmanuel Joke (JIRA)
Add a combiner to improve performance on updatedb - Key: NUTCH-530 URL: https://issues.apache.org/jira/browse/NUTCH-530 Project: Nutch Issue Type: Improvement Environment: java 1.6

[Nutch-dev] [jira] Updated: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-29 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-530: Attachment: NUTCH-530.patch Patch provided. It reduced the process time by 20%. Output from the

[Nutch-dev] [jira] Created: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
CrawlDbMerger: wrong computation of last fetch time --- Key: NUTCH-532 URL: https://issues.apache.org/jira/browse/NUTCH-532 Project: Nutch Issue Type: Bug Reporter: Emmanuel Joke

[Nutch-dev] [jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532.patch Patch provided. CrawlDbMerger: wrong computation of last fetch time

[Nutch-dev] [jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: (was: NUTCH-532.patch) CrawlDbMerger: wrong computation of last fetch time

[Nutch-dev] [jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532.patch CrawlDbMerger: wrong computation of last fetch time

[Nutch-dev] [jira] Created: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Emmanuel Joke (JIRA)
LinkDbMerger: url normlaized is not updated in the key and inlinks list --- Key: NUTCH-533 URL: https://issues.apache.org/jira/browse/NUTCH-533 Project: Nutch Issue Type:

[Nutch-dev] [jira] Updated: (NUTCH-533) LinkDbMerger: url normlaized is not updated in the key and inlinks list

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-533: Attachment: NUTCH-533.patch Patch provided LinkDbMerger: url normlaized is not updated in the key

[Nutch-dev] [jira] Commented: (NUTCH-526) Use a combiner in LinDbMerger to improve the performance as in LinkDb

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516358 ] Emmanuel Joke commented on NUTCH-526: - Could you please wait again few days ? I would like to wait for a

[Nutch-dev] [jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516602 ] Emmanuel Joke commented on NUTCH-530: - I'm sure to follow your point regarding the outlinks number. I don't

[Nutch-dev] [jira] Updated: (NUTCH-533) LinkDbMerger: url normalized is not updated in the key and inlinks list

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-533: Attachment: NUTCH-533.patch Patch with typo fixed. LinkDbMerger: url normalized is not updated in

[Nutch-dev] [jira] Commented: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-07-30 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516618 ] Emmanuel Joke commented on NUTCH-532: - res.getFetchTime() - Math.round(res.getFetchInterval() * 1000d); always

[Nutch-dev] [jira] Updated: (NUTCH-534) SegmentMerger: add -normalize option

2007-07-31 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-534: Attachment: NUTCH-534.patch Patch provided SegmentMerger: add -normalize option

[Nutch-dev] [jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

2007-07-31 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516675 ] Emmanuel Joke commented on NUTCH-530: - Actually I don't re-use CrawlDbReducer, I've define a new class as

[Nutch-dev] [jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532_v2.patch New patch provided * Add new method to CrawlDatum

[Nutch-dev] [jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532_v3.patch My mistake, acually i'm not really familiar with the VERSION. I

[Nutch-dev] [jira] Updated: (NUTCH-532) CrawlDbMerger: wrong computation of last fetch time

2007-08-06 Thread Emmanuel Joke (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emmanuel Joke updated NUTCH-532: Attachment: NUTCH-532_v4.patch I updated the code following Andrzej comments. I've also update the