[jira] [Created] (NUTCH-1822) Page outlinks clearance is not appropriate
Riyaz Shaik created NUTCH-1822: -- Summary: Page outlinks clearance is not appropriate Key: NUTCH-1822 URL: https://issues.apache.org/jira/browse/NUTCH-1822 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.1 Environment: Nutch-2.1 Hadoop-0.20.205 HBase-0.90.6 hbase-gora-0.2.1 Reporter: Riyaz Shaik 1. When a page is re-crawled and identified with new outlink urls along with the existing urls, old outlinks are getting removed and only new urls are updated to hbase. Ex: Crawl cycle 1 for www.123.com, identified outlinks are ol --> abc.com ol --> pqr.com Crawlcyle 2 of same www.123.com, the outlinks are (note that abc.com is removed and added with xyz.com) ol --> pqr.com ol --> xyz.com At the end of crawlcycle 2, base has only xyz.com as outlink ol -->xyz.com Expected: ol --> pqr.com ol --> xyz.com 2. If some of the outlinks of the page got removed and no new outlinks are added to the page then page re-crawl is not clearing the obsolete/removed outlinks from hbase. Ex: Cycle 1 crawled page : www.test.com, identified outlinks are ol -->link1 ol-->link2 ol-->link3 Cycle 2 same page(www.text.com) re-crawled, identified outlinks are (Note: only removed the link2 no new links are added) ol-->link1 ol-->link3 but the end of the cycle 2.,it has all the 3 outlinks in hbase in habse: ol -->link1 ol-->link2 ol-->link3 expected: ol-->link1 ol-->link3 As per the code ParseUtil.java, it seems to be removing the old links and insets onlythe new links. if (page.getOutlinks() != null) { page.getOutlinks().clear(); } http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html Thanks Riyaz -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919 ] Riyaz Shaik edited comment on NUTCH-1614 at 6/12/14 9:13 AM: - We have implemented a similar kind of feature for crawling our sites a year ago or so, I have come across this ticket so just thought of sharing the implementation approach(It's not a plugin approach like existing filters/normalizers). Created a util class for our customization to handle reading different types of regex patterns like include and exclude as nutch supports. (on) Nutch version : 2.1 * org.apache.nutch.util.RegexUtil (source code attached) Added the following changes to IndexerJob class * org.apache.nutch.indexer.IndexerJob (attached the source code) code snippet: {code} package org.apache.nutch.indexer; import org.apache.nutch.util.RegexUtil; import org.apache.nutch.util.TableUtil; public abstract class IndexerJob extends NutchTool implements Tool { public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class); public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = "indexing.exclude.url.patterns.file"; public void setup(Context context) throws IOException { . String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE); if (regexPatternsFileName != null) { LOG.info("Loading indexing exculde patterns from the nutch configurations:"); RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName)); } } public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } === /* * To skip the matched url patterns from indexing. * */ String pageUrl = TableUtil.unreverseUrl(key); if (RegexUtil.findMatch(pageUrl)){ LOG.info("Skipping the url : " + pageUrl + " from indexing; matched the indexing exclude url patterns."); return; } == ... . {code} * Add the following property to *??nutch-site.xml??* {code} indexing.exclude.url.patterns.file crawl-donot-index-patterns.txt {code} sample patterns to exclude from indexing {code} /news/$ /news/latest/$ /videos/$ /music/$ /photos/$ /movies/$ /ontv/$ {code} was (Author: riyaz): We have implemented a similar kind of feature for crawling our sites a year ago or so, I have come across this ticket so just thought of sharing the implementation approach(It's not a plugin approach like existing filters/normalizers). Created a util class for our customization to handle reading different types of regex patterns like include and exclude as nutch supports. (on) Nutch version : 2.1 * org.apache.nutch.util.RegexUtil (source code attached) Added the following changes to IndexerJob class * org.apache.nutch.indexer.IndexerJob (attached the source code) code snippet: {code} package org.apache.nutch.indexer; import org.apache.nutch.util.RegexUtil; import org.apache.nutch.util.TableUtil; public abstract class IndexerJob extends NutchTool implements Tool { public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class); public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = "indexing.exclude.url.patterns.file"; public void setup(Context context) throws IOException { . String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE); if (regexPatternsFileName != null) { LOG.info("Loading indexing exculde patterns from the nutch configurations:"); RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName)); } } public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } === /* * To skip the matched url patterns from indexing. * */ String pageUrl = TableUtil.unreverseUrl(key); if (RegexUtil.findMatch(pageUrl)){ LOG.info("Skipping the url : " + pageUrl + " from indexing; matched the indexing exclude url patterns."); return; } == ... . {code} * Add the following property to *??nutch-site.xml??* {code} indexing.exclude.url.patterns.file crawl-donot-index-patterns.txt {code} > Plugin to excl
[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919 ] Riyaz Shaik edited comment on NUTCH-1614 at 6/12/14 9:13 AM: - We have implemented a similar kind of feature for crawling our sites a year ago or so, I have come across this ticket so just thought of sharing the implementation approach(It's not a plugin approach like existing filters/normalizers). Created a util class for our customization to handle reading different types of regex patterns like include and exclude as nutch supports. (on) Nutch version : 2.1 * org.apache.nutch.util.RegexUtil (source code attached) Added the following changes to IndexerJob class * org.apache.nutch.indexer.IndexerJob (attached the source code) code snippet: {code} package org.apache.nutch.indexer; import org.apache.nutch.util.RegexUtil; import org.apache.nutch.util.TableUtil; public abstract class IndexerJob extends NutchTool implements Tool { public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class); public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = "indexing.exclude.url.patterns.file"; public void setup(Context context) throws IOException { . String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE); if (regexPatternsFileName != null) { LOG.info("Loading indexing exculde patterns from the nutch configurations:"); RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName)); } } public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } === /* * To skip the matched url patterns from indexing. * */ String pageUrl = TableUtil.unreverseUrl(key); if (RegexUtil.findMatch(pageUrl)){ LOG.info("Skipping the url : " + pageUrl + " from indexing; matched the indexing exclude url patterns."); return; } == ... . {code} * Add the following property to *??nutch-site.xml??* {code} indexing.exclude.url.patterns.file crawl-donot-index-patterns.txt {code} sample patterns to exclude from indexing(crawl-donot-index-patterns.txt) {code} /news/$ /news/latest/$ /videos/$ /music/$ /photos/$ /movies/$ /ontv/$ {code} was (Author: riyaz): We have implemented a similar kind of feature for crawling our sites a year ago or so, I have come across this ticket so just thought of sharing the implementation approach(It's not a plugin approach like existing filters/normalizers). Created a util class for our customization to handle reading different types of regex patterns like include and exclude as nutch supports. (on) Nutch version : 2.1 * org.apache.nutch.util.RegexUtil (source code attached) Added the following changes to IndexerJob class * org.apache.nutch.indexer.IndexerJob (attached the source code) code snippet: {code} package org.apache.nutch.indexer; import org.apache.nutch.util.RegexUtil; import org.apache.nutch.util.TableUtil; public abstract class IndexerJob extends NutchTool implements Tool { public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class); public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = "indexing.exclude.url.patterns.file"; public void setup(Context context) throws IOException { . String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE); if (regexPatternsFileName != null) { LOG.info("Loading indexing exculde patterns from the nutch configurations:"); RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName)); } } public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } === /* * To skip the matched url patterns from indexing. * */ String pageUrl = TableUtil.unreverseUrl(key); if (RegexUtil.findMatch(pageUrl)){ LOG.info("Skipping the url : " + pageUrl + " from indexing; matched the indexing exclude url patterns."); return; } == ... . {code} * Add the following property to *??nutch-site.xml??* {code} indexing.exclude.url.patterns.file crawl-donot-index-patterns
[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919 ] Riyaz Shaik edited comment on NUTCH-1614 at 6/12/14 9:08 AM: - We have implemented a similar kind of feature for crawling our sites a year ago or so, I have come across this ticket so just thought of sharing the implementation approach(It's not a plugin approach like existing filters/normalizers). Created a util class for our customization to handle reading different types of regex patterns like include and exclude as nutch supports. (on) Nutch version : 2.1 * org.apache.nutch.util.RegexUtil (source code attached) Added the following changes to IndexerJob class * org.apache.nutch.indexer.IndexerJob (attached the source code) code snippet: {code} package org.apache.nutch.indexer; import org.apache.nutch.util.RegexUtil; import org.apache.nutch.util.TableUtil; public abstract class IndexerJob extends NutchTool implements Tool { public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class); public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = "indexing.exclude.url.patterns.file"; public void setup(Context context) throws IOException { . String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE); if (regexPatternsFileName != null) { LOG.info("Loading indexing exculde patterns from the nutch configurations:"); RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName)); } } public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } === /* * To skip the matched url patterns from indexing. * */ String pageUrl = TableUtil.unreverseUrl(key); if (RegexUtil.findMatch(pageUrl)){ LOG.info("Skipping the url : " + pageUrl + " from indexing; matched the indexing exclude url patterns."); return; } == ... . {code} * Add the following property to *??nutch-site.xml??* {code} indexing.exclude.url.patterns.file crawl-donot-index-patterns.txt {code} was (Author: riyaz): I have implemented a similar kind of feature for crawling our sites a year ago or so, I have come across this ticket so just thought of sharing the implementation approach(It's not a plugin approach like existing filters/normalizers). Created a util class for our customization to handle reading different types of regex patterns like include and exclude as nutch supports. (on) Nutch version : 2.1 * org.apache.nutch.util.RegexUtil (source code attached) Added the following changes to IndexerJob class * org.apache.nutch.indexer.IndexerJob (attached the source code) code snippet: {code} package org.apache.nutch.indexer; import org.apache.nutch.util.RegexUtil; import org.apache.nutch.util.TableUtil; public abstract class IndexerJob extends NutchTool implements Tool { public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class); public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = "indexing.exclude.url.patterns.file"; public void setup(Context context) throws IOException { . String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE); if (regexPatternsFileName != null) { LOG.info("Loading indexing exculde patterns from the nutch configurations:"); RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName)); } } public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } === /* * To skip the matched url patterns from indexing. * */ String pageUrl = TableUtil.unreverseUrl(key); if (RegexUtil.findMatch(pageUrl)){ LOG.info("Skipping the url : " + pageUrl + " from indexing; matched the indexing exclude url patterns."); return; } == ... . {code} * Add the following property to *??nutch-site.xml??* {code} indexing.exclude.url.patterns.file crawl-donot-index-patterns.txt {code} > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --
[jira] [Updated] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Riyaz Shaik updated NUTCH-1614: --- Attachment: IndexerJob.java RegexUtil.java > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: plugin > Attachments: IndexerJob.java, NUTCH-1614.patch, RegexUtil.java > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > > indexer.url.filter.exclude.regex.file > regex-indexer-exclude-urls.txt > > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919 ] Riyaz Shaik commented on NUTCH-1614: I have implemented a similar kind of feature for crawling our sites a year ago or so, I have come across this ticket so just thought of sharing the implementation approach(It's not a plugin approach like existing filters/normalizers). Created a util class for our customization to handle reading different types of regex patterns like include and exclude as nutch supports. (on) Nutch version : 2.1 * org.apache.nutch.util.RegexUtil (source code attached) Added the following changes to IndexerJob class * org.apache.nutch.indexer.IndexerJob (attached the source code) code snippet: {code} package org.apache.nutch.indexer; import org.apache.nutch.util.RegexUtil; import org.apache.nutch.util.TableUtil; public abstract class IndexerJob extends NutchTool implements Tool { public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class); public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = "indexing.exclude.url.patterns.file"; public void setup(Context context) throws IOException { . String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE); if (regexPatternsFileName != null) { LOG.info("Loading indexing exculde patterns from the nutch configurations:"); RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName)); } } public void map(String key, WebPage page, Context context) throws IOException, InterruptedException { ParseStatus pstatus = page.getParseStatus(); if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus) || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) { return; // filter urls not parsed } === /* * To skip the matched url patterns from indexing. * */ String pageUrl = TableUtil.unreverseUrl(key); if (RegexUtil.findMatch(pageUrl)){ LOG.info("Skipping the url : " + pageUrl + " from indexing; matched the indexing exclude url patterns."); return; } == ... . {code} * Add the following property to *??nutch-site.xml??* {code} indexing.exclude.url.patterns.file crawl-donot-index-patterns.txt {code} > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: plugin > Attachments: NUTCH-1614.patch > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > > indexer.url.filter.exclude.regex.file > regex-indexer-exclude-urls.txt > > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Riyaz Shaik updated NUTCH-1457: --- Attachment: NUTCH-1457(Nutch-2.2.1)-src.zip NUTCH-1457(Nutch-2.1)-src.zip NUTCH-1457(Nutch-2.2.1).patch NUTCH-1457(Nutch-2.1).patch > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: CrawlStatus.java, DbUpdateReducer.java, > GeneratorMapper.java, GeneratorReducer.java, NUTCH-1457(Nutch-2.1).patch, > NUTCH-1457(Nutch-2.1)-src.zip, NUTCH-1457(Nutch-2.2.1).patch, > NUTCH-1457(Nutch-2.2.1)-src.zip > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718439#comment-13718439 ] Riyaz Shaik commented on NUTCH-1457: Hi Ferdy/Lewis, It seems trunk has the Nutch-1.4 version code as per the SVN check-in logs and mail archives. http://www.mail-archive.com/dev@nutch.apache.org/msg04348.html I had created patches for the branches : *Nutch-2.1* and *Nutch-2.2.1* Attached the modified source code files as a Zip and patches. (on) Patch contains following fixes other than NUTCH-1457: (+) org.apache.nutch.crawl.AbstractFetchSchedule * Fix for resetting fetchTime to currentTime, if the *??fetchTime-currTime > maxInterval??*. Since *“shouldFetch”* method returning false even after setting the new fetchTime to page. So, that new fetchTime changes will not be available to GeneratorReducer to persist the changes in HBase. (+) org.apache.nutch.parse.ParseUtil * Moved the page signature calculation code(a line of code). Existing code calculating the page signature without parsed plain text(Ex: from HTMLParser), that causes signature calculation on entire page content even after enabling the “org.apache.nutch.crawl.TextProfileSignature”. Can you please validate the changes?. Thanks Riyaz > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: CrawlStatus.java, DbUpdateReducer.java, > GeneratorMapper.java, GeneratorReducer.java > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510 ] Riyaz Shaik edited comment on NUTCH-1457 at 7/17/13 7:34 PM: - Hi Ferdy, The below mentioned scenario will not occur: *although there might be a problem with code that assumes STATUS_FETCHED, for example the ParserJob: It only processes STATUS_FETCHED entries. There may be more dependencies.* Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose *??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not be processed in the Fetcher/Parser jobs. One of the drawaback of this solution(UNSCHEDULED status/mark in GeneratorMapper) could be "We are updating the few columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase" from ??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance. We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. It is working fine and also it overcomes the drawback of our earlier solution. Will attach the code changes. Thanks Ferdy.. :) was (Author: riyaz): Hi Ferdy, The below mentioned scenario will not occur: *although there might be a problem with code that assumes STATUS_FETCHED, for example the ParserJob: It only processes STATUS_FETCHED entries. There may be more dependencies.* Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose *??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not be processed in the Fetcher/Parser jobs. One of the drawaback of this solution could be "We are updating the few columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase" from ??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance. We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. It is working fine and also it overcomes the drawback of our earlier solution. Will attach the code changes. Thanks Ferdy.. :) > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: CrawlStatus.java, DbUpdateReducer.java, > GeneratorMapper.java, GeneratorReducer.java > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510 ] Riyaz Shaik commented on NUTCH-1457: Hi Ferdy, The below mentioned scenario will not occur: *although there might be a problem with code that assumes STATUS_FETCHED, for example the ParserJob: It only processes STATUS_FETCHED entries. There may be more dependencies.* Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose *??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not be processed in the Fetcher/Parser jobs. One of the drawaback of this solution could be "We are updating the few columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase" from ??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance. We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. It is working fine and also it overcomes the drawback of our earlier solution. Will attach the code changes. Thanks Ferdy.. :) > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: CrawlStatus.java, DbUpdateReducer.java, > GeneratorMapper.java, GeneratorReducer.java > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704435#comment-13704435 ] Riyaz Shaik edited comment on NUTCH-1457 at 7/10/13 11:13 AM: -- That logic may not work in the following scenario. (fetchtime > currentTime) may be true when generateJob is running, but it will return false when the DbUpdaterJob is running, if fetch & parse takes too much time. This will again lead to the same issue. Instead, I have made the following fix locally(on 2.1) and testing it. It seems to be working fine. It will be great if some one validates this fix. 1. Introduced the new crawl Status on CrawlStatus.java {code} public static final byte STATUS_UNSCHEDULED= 0x20; NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled"); {code} 2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page status to the new status(UNSCHEDULED) and allow the page to reducer. {code} public void map(String reversedUrl, WebPage page, Context context) throws IOException, InterruptedException { …. .. // check fetch schedule boolean shouldFetch = schedule.shouldFetch(url, page, curTime); float score = page.getScore(); if (!shouldFetch) { page.setStatus(CrawlStatus.STATUS_UNSCHEDULED); if (GeneratorJob.LOG.isDebugEnabled()) { GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "', fetchTime=" + page.getFetchTime() + ", curTime=" + curTime); } } else { try { score = scoringFilters.generatorSortValue(url, page, score); } catch (ScoringFilterException e) { //ignore } } entry.set(url, score); context.write(entry, page); } {code} 3. In GeneratorReducer.java, skip all other processing for the status UNSCHEDULED and persist the data to the webpage. {code} protected void reduce(SelectorEntry key, Iterable values, Context context) throws IOException, InterruptedException { for (WebPage page : values) { if (count >= limit) { return; } if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) { writeOutput(context, key.url, page); continue; } if (maxCount > 0) { String hostordomain; if (byDomain) { hostordomain = URLUtil.getDomainName(key.url); } else { hostordomain = URLUtil.getHost(key.url); } Integer hostCount = hostCountMap.get(hostordomain); if (hostCount == null) { hostCountMap.put(hostordomain, 0); hostCount = 0; } if (hostCount >= maxCount) { return; } hostCountMap.put(hostordomain, hostCount + 1); } Mark.GENERATE_MARK.putMark(page, batchId); if (!writeOutput(context, key.url, page)) { context.getCounter("Generator", "MALFORMED_URL").increment(1); continue; } context.getCounter("Generator", "GENERATE_MARK").increment(1); count++; } } {code} 4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is UNSCHEDULED and call a regular forceRefetch. {code} protected void reduce(UrlWithScore key, Iterable values, Context context) throws IOException, InterruptedException { …. .. byte status = (byte)page.getStatus(); switch (status) { case CrawlStatus.STATUS_UNSCHEDULED: // not scheduled for generate. due to fetchtime > currentime if (maxInterval < page.getFetchInterval()) schedule.forceRefetch(url, page, false); break; case CrawlStatus.STATUS_FETCHED: // succesful fetch case CrawlStatus.STATUS_REDIR_TEMP: // successful fetch, redirected case CrawlStatus.STATUS_REDIR_PERM: case CrawlStatus.STATUS_NOTMODIFIED: // successful fetch, notmodified int modified = FetchSchedule.STATUS_UNKNOWN; if (status == CrawlStatus.STATUS_NOTMODIFIED) { modified = FetchSchedule.STATUS_NOTMODIFIED; … } {code} was (Author: riyaz): That logic may not work in the following scenario. (fetchtime > currentTime) may be true when generateJob is running, but it will return false when the DbUpdaterJob is running, if fetch & parse takes too much time. This will again lead to the same issue. Instead, I have made the following fix locally and testing it. It seems to be working fine. It will be great if some one validates this fix. 1. Introduced the new crawl Status on CrawlStatus.java {code} public static final byte STATUS_UNSCHEDULED= 0x20; NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled"); {code} 2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page status to the new status(UNSCHEDULED) and allow the page to reducer. {code} public void map(String reversedUrl, WebPage page, Context context)
[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704435#comment-13704435 ] Riyaz Shaik commented on NUTCH-1457: That logic may not work in the following scenario. (fetchtime > currentTime) may be true when generateJob is running, but it will return false when the DbUpdaterJob is running, if fetch & parse takes too much time. This will again lead to the same issue. Instead, I have made the following fix locally and testing it. It seems to be working fine. It will be great if some one validates this fix. 1. Introduced the new crawl Status on CrawlStatus.java {code} public static final byte STATUS_UNSCHEDULED= 0x20; NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled"); {code} 2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page status to the new status(UNSCHEDULED) and allow the page to reducer. {code} public void map(String reversedUrl, WebPage page, Context context) throws IOException, InterruptedException { …. .. // check fetch schedule boolean shouldFetch = schedule.shouldFetch(url, page, curTime); float score = page.getScore(); if (!shouldFetch) { page.setStatus(CrawlStatus.STATUS_UNSCHEDULED); if (GeneratorJob.LOG.isDebugEnabled()) { GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "', fetchTime=" + page.getFetchTime() + ", curTime=" + curTime); } } else { try { score = scoringFilters.generatorSortValue(url, page, score); } catch (ScoringFilterException e) { //ignore } } entry.set(url, score); context.write(entry, page); } {code} 3. In GeneratorReducer.java, skip all other processing for the status UNSCHEDULED and persist the data to the webpage. {code} protected void reduce(SelectorEntry key, Iterable values, Context context) throws IOException, InterruptedException { for (WebPage page : values) { if (count >= limit) { return; } if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) { writeOutput(context, key.url, page); continue; } if (maxCount > 0) { String hostordomain; if (byDomain) { hostordomain = URLUtil.getDomainName(key.url); } else { hostordomain = URLUtil.getHost(key.url); } Integer hostCount = hostCountMap.get(hostordomain); if (hostCount == null) { hostCountMap.put(hostordomain, 0); hostCount = 0; } if (hostCount >= maxCount) { return; } hostCountMap.put(hostordomain, hostCount + 1); } Mark.GENERATE_MARK.putMark(page, batchId); if (!writeOutput(context, key.url, page)) { context.getCounter("Generator", "MALFORMED_URL").increment(1); continue; } context.getCounter("Generator", "GENERATE_MARK").increment(1); count++; } } {code} 4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is UNSCHEDULED and call a regular forceRefetch. {code} protected void reduce(UrlWithScore key, Iterable values, Context context) throws IOException, InterruptedException { …. .. byte status = (byte)page.getStatus(); switch (status) { case CrawlStatus.STATUS_UNSCHEDULED: // not scheduled for generate. due to fetchtime > currentime if (maxInterval < page.getFetchInterval()) schedule.forceRefetch(url, page, false); break; case CrawlStatus.STATUS_FETCHED: // succesful fetch case CrawlStatus.STATUS_REDIR_TEMP: // successful fetch, redirected case CrawlStatus.STATUS_REDIR_PERM: case CrawlStatus.STATUS_NOTMODIFIED: // successful fetch, notmodified int modified = FetchSchedule.STATUS_UNKNOWN; if (status == CrawlStatus.STATUS_NOTMODIFIED) { modified = FetchSchedule.STATUS_NOTMODIFIED; … } {code} > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700965#comment-13700965 ] Riyaz Shaik commented on NUTCH-1457: Hi, Is it possible to have a simple logic like (fetchTime > currentTime) don't set/modify the FetchSchedule in DbUpdateReducer? Thanks Riyaz > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira