Avoid parsing uneccessary links and get a more relevant outlink list
Key: NUTCH-488
URL: https://issues.apache.org/jira/browse/NUTCH-488
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-488:
Attachment: DOMContentUtils.patch
Avoid parsing uneccessary links and get a more relevant outlink
[
https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-488:
Attachment: nutch-default.xml.patch
Avoid parsing uneccessary links and get a more relevant
URLFilter-suffix management of the url path when the url contains some query
parameters
---
Key: NUTCH-489
URL: https://issues.apache.org/jira/browse/NUTCH-489
[
https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-489:
Attachment: SuffixURLFilter.java.patch
suffix-urlfilter.txt.patch
URLFilter-suffix
[
https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-489:
Attachment: SuffixURLFilter_v2.java.patch
My mistake...
I've added a new patchwhich is supposed
Add hadoop masters configuration file into conf folder
--
Key: NUTCH-500
URL: https://issues.apache.org/jira/browse/NUTCH-500
Project: Nutch
Issue Type: Improvement
Components:
[
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506922
]
Emmanuel Joke commented on NUTCH-503:
-
I just try your patch and i'm afraid I still have the same issue.
[
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507469
]
Emmanuel Joke commented on NUTCH-503:
-
Sorry, my mistake.
My compiled jar was not correctly included in my
[
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509039
]
Emmanuel Joke commented on NUTCH-503:
-
Results seems to good. So I'm wondering if it is possible to commit this
lib-lucene-analyzers jar defintion is wrong in plugin.xml
-
Key: NUTCH-507
URL: https://issues.apache.org/jira/browse/NUTCH-507
Project: Nutch
Issue Type: Bug
Environment:
Update Crawldb: avoid to start a job if there is no valid segment
-
Key: NUTCH-509
URL: https://issues.apache.org/jira/browse/NUTCH-509
Project: Nutch
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke closed NUTCH-509.
---
Resolution: Won't Fix
As explain by Doğacan, the Crawldb update has a good behaviour. This patch is
Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE
Key: NUTCH-516
URL: https://issues.apache.org/jira/browse/NUTCH-516
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-516:
Attachment: NUTCH-516.patch
I fxied the issue by changing the FetchTime in
Use URLValidator in the Injector
Key: NUTCH-522
URL: https://issues.apache.org/jira/browse/NUTCH-522
Project: Nutch
Issue Type: Improvement
Components: injector
Reporter: Emmanuel Joke
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: NUTCH-522.patch
Patch provided
Use URLValidator in the Injector
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: NUTCH-522_v2.patch
Oops, my mistake. Please find an updated patch.
Actually I've a
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514153
]
Emmanuel Joke commented on NUTCH-522:
-
Actually I tried to fetch the url
[
https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-526:
Attachment: NUTCH-526.patch
patch provided
Use a combiner in LinDbMerger to improve the
Use a combiner in LinDbMerger to improve the performance as in LinkDb
-
Key: NUTCH-526
URL: https://issues.apache.org/jira/browse/NUTCH-526
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-528:
Attachment: NUTCH-528.patch
patch attached
CrawlDbReader: add some new stats + dump into a csv
CrawlDbReader: add some new stats + dump into a csv format
--
Key: NUTCH-528
URL: https://issues.apache.org/jira/browse/NUTCH-528
Project: Nutch
Issue Type: Improvement
NodeWalker.skipChildren don't wrok for more than 1 child.
-
Key: NUTCH-529
URL: https://issues.apache.org/jira/browse/NUTCH-529
Project: Nutch
Issue Type: Bug
Reporter:
[
https://issues.apache.org/jira/browse/NUTCH-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-529:
Attachment: NUTCH-529.patch
patch attached
NodeWalker.skipChildren don't wrok for more than 1
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: NUTCH-522_v3.patch
Use URLValidator in the Injector
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: NUTCH-522_v3.patch
commons-validator's UrlValidator does not filter URLS with space.
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-522:
Attachment: (was: NUTCH-522_v3.patch)
Use URLValidator in the Injector
[
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516138
]
Emmanuel Joke commented on NUTCH-522:
-
I tried with protocol-http and protocol-httpclient, i got the same error
[
https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516209
]
Emmanuel Joke commented on NUTCH-526:
-
Actually i made a simple test on 2 small linkdb, and i didn't see any
Add a combiner to improve performance on updatedb
-
Key: NUTCH-530
URL: https://issues.apache.org/jira/browse/NUTCH-530
Project: Nutch
Issue Type: Improvement
Environment: java 1.6
[
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-530:
Attachment: NUTCH-530.patch
Patch provided.
It reduced the process time by 20%.
Output from the
CrawlDbMerger: wrong computation of last fetch time
---
Key: NUTCH-532
URL: https://issues.apache.org/jira/browse/NUTCH-532
Project: Nutch
Issue Type: Bug
Reporter: Emmanuel Joke
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532.patch
Patch provided.
CrawlDbMerger: wrong computation of last fetch time
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: (was: NUTCH-532.patch)
CrawlDbMerger: wrong computation of last fetch time
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532.patch
CrawlDbMerger: wrong computation of last fetch time
LinkDbMerger: url normlaized is not updated in the key and inlinks list
---
Key: NUTCH-533
URL: https://issues.apache.org/jira/browse/NUTCH-533
Project: Nutch
Issue Type:
[
https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-533:
Attachment: NUTCH-533.patch
Patch provided
LinkDbMerger: url normlaized is not updated in the key
[
https://issues.apache.org/jira/browse/NUTCH-526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516358
]
Emmanuel Joke commented on NUTCH-526:
-
Could you please wait again few days ?
I would like to wait for a
[
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516602
]
Emmanuel Joke commented on NUTCH-530:
-
I'm sure to follow your point regarding the outlinks number.
I don't
[
https://issues.apache.org/jira/browse/NUTCH-533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-533:
Attachment: NUTCH-533.patch
Patch with typo fixed.
LinkDbMerger: url normalized is not updated in
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516618
]
Emmanuel Joke commented on NUTCH-532:
-
res.getFetchTime() - Math.round(res.getFetchInterval() * 1000d); always
[
https://issues.apache.org/jira/browse/NUTCH-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-534:
Attachment: NUTCH-534.patch
Patch provided
SegmentMerger: add -normalize option
[
https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516675
]
Emmanuel Joke commented on NUTCH-530:
-
Actually I don't re-use CrawlDbReducer, I've define a new class as
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532_v2.patch
New patch provided
* Add new method to CrawlDatum
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532_v3.patch
My mistake, acually i'm not really familiar with the VERSION.
I
[
https://issues.apache.org/jira/browse/NUTCH-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Emmanuel Joke updated NUTCH-532:
Attachment: NUTCH-532_v4.patch
I updated the code following Andrzej comments. I've also update the
47 matches
Mail list logo