[jira] [Created] (NUTCH-1822) Page outlinks clearance is not appropriate

2014-07-22 Thread Riyaz Shaik (JIRA)
Riyaz Shaik created NUTCH-1822:
--

 Summary: Page outlinks  clearance is not appropriate
 Key: NUTCH-1822
 URL: https://issues.apache.org/jira/browse/NUTCH-1822
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.1
 Environment: Nutch-2.1
Hadoop-0.20.205
HBase-0.90.6
hbase-gora-0.2.1
Reporter: Riyaz Shaik


1. When a page is re-crawled and identified with new outlink urls along with 
the existing urls, old outlinks are getting removed and only new urls are 
updated to hbase. 
Ex:
 Crawl cycle 1 for www.123.com, identified outlinks are 
ol  --> abc.com 
ol --> pqr.com 
Crawlcyle 2 of same www.123.com, the outlinks are
(note that abc.com is removed and added with xyz.com) 
ol --> pqr.com 
ol --> xyz.com 
At the end of crawlcycle 2, base has only xyz.com as outlink
ol -->xyz.com

Expected:
ol --> pqr.com 
ol --> xyz.com 

2. If some of the outlinks of the page got removed and no new outlinks are 
added to the page then page re-crawl is not clearing the obsolete/removed 
outlinks from hbase.

Ex: Cycle 1 crawled page : www.test.com, identified outlinks are
ol -->link1
ol-->link2
ol-->link3
Cycle 2 same page(www.text.com) re-crawled, identified outlinks are
(Note: only removed the link2 no new links are added)
 ol-->link1
ol-->link3
 but the end of the cycle 2.,it has all the 3 outlinks in hbase
in habse:
ol -->link1
ol-->link2
ol-->link3

expected:
 ol-->link1
ol-->link3
As per the code ParseUtil.java, it seems to be removing the old links and 
insets onlythe new links. 
if (page.getOutlinks() != null) { page.getOutlinks().clear(); }

http://lucene.472066.n3.nabble.com/Nutch-New-outlinks-removes-old-valid-outlinks-td4146676.html
Thanks
Riyaz





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2014-06-12 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919
 ] 

Riyaz Shaik edited comment on NUTCH-1614 at 6/12/14 9:13 AM:
-

We have implemented a similar kind of feature for crawling our sites a year ago 
or so, I have come across this ticket so just thought of sharing the 
implementation approach(It's not a plugin approach like existing 
filters/normalizers).

Created a util class for our customization to handle reading different types of 
regex patterns like include and exclude as nutch supports.

(on) Nutch version : 2.1
* org.apache.nutch.util.RegexUtil (source code attached)
Added the following changes to IndexerJob class
* org.apache.nutch.indexer.IndexerJob (attached the source code)
code snippet:
{code} 
package org.apache.nutch.indexer;

import org.apache.nutch.util.RegexUtil;
import org.apache.nutch.util.TableUtil;
public abstract class IndexerJob extends NutchTool implements Tool {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class);
  public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = 
"indexing.exclude.url.patterns.file";

 public void setup(Context context) throws IOException {
.
  String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE);
  if (regexPatternsFileName != null) {
  LOG.info("Loading indexing exculde patterns from the nutch 
configurations:");
  
RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName));
  }
}

public void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
  ParseStatus pstatus = page.getParseStatus();
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
  || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
return; // filter urls not parsed
  }
  ===
 /*
   * To skip the matched url patterns from indexing.
   * 
   */
  String pageUrl = TableUtil.unreverseUrl(key);
  if (RegexUtil.findMatch(pageUrl)){
  LOG.info("Skipping the url : " + pageUrl + " from indexing; matched 
the indexing exclude url patterns.");
  return;
  }
  ==
...
.
{code}
* Add the following property to *??nutch-site.xml??*
{code}

  indexing.exclude.url.patterns.file
  crawl-donot-index-patterns.txt
 
{code}
sample patterns to exclude from indexing
{code}
/news/$
/news/latest/$
/videos/$
/music/$
/photos/$
/movies/$
/ontv/$
{code}


was (Author: riyaz):
We have implemented a similar kind of feature for crawling our sites a year ago 
or so, I have come across this ticket so just thought of sharing the 
implementation approach(It's not a plugin approach like existing 
filters/normalizers).

Created a util class for our customization to handle reading different types of 
regex patterns like include and exclude as nutch supports.

(on) Nutch version : 2.1
* org.apache.nutch.util.RegexUtil (source code attached)
Added the following changes to IndexerJob class
* org.apache.nutch.indexer.IndexerJob (attached the source code)
code snippet:
{code} 
package org.apache.nutch.indexer;

import org.apache.nutch.util.RegexUtil;
import org.apache.nutch.util.TableUtil;
public abstract class IndexerJob extends NutchTool implements Tool {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class);
  public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = 
"indexing.exclude.url.patterns.file";

 public void setup(Context context) throws IOException {
.
  String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE);
  if (regexPatternsFileName != null) {
  LOG.info("Loading indexing exculde patterns from the nutch 
configurations:");
  
RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName));
  }
}

public void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
  ParseStatus pstatus = page.getParseStatus();
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
  || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
return; // filter urls not parsed
  }
  ===
 /*
   * To skip the matched url patterns from indexing.
   * 
   */
  String pageUrl = TableUtil.unreverseUrl(key);
  if (RegexUtil.findMatch(pageUrl)){
  LOG.info("Skipping the url : " + pageUrl + " from indexing; matched 
the indexing exclude url patterns.");
  return;
  }
  ==
...
.
{code}
* Add the following property to *??nutch-site.xml??*
{code}

  indexing.exclude.url.patterns.file
  crawl-donot-index-patterns.txt
 
{code}


> Plugin to excl

[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2014-06-12 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919
 ] 

Riyaz Shaik edited comment on NUTCH-1614 at 6/12/14 9:13 AM:
-

We have implemented a similar kind of feature for crawling our sites a year ago 
or so, I have come across this ticket so just thought of sharing the 
implementation approach(It's not a plugin approach like existing 
filters/normalizers).

Created a util class for our customization to handle reading different types of 
regex patterns like include and exclude as nutch supports.

(on) Nutch version : 2.1
* org.apache.nutch.util.RegexUtil (source code attached)
Added the following changes to IndexerJob class
* org.apache.nutch.indexer.IndexerJob (attached the source code)
code snippet:
{code} 
package org.apache.nutch.indexer;

import org.apache.nutch.util.RegexUtil;
import org.apache.nutch.util.TableUtil;
public abstract class IndexerJob extends NutchTool implements Tool {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class);
  public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = 
"indexing.exclude.url.patterns.file";

 public void setup(Context context) throws IOException {
.
  String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE);
  if (regexPatternsFileName != null) {
  LOG.info("Loading indexing exculde patterns from the nutch 
configurations:");
  
RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName));
  }
}

public void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
  ParseStatus pstatus = page.getParseStatus();
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
  || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
return; // filter urls not parsed
  }
  ===
 /*
   * To skip the matched url patterns from indexing.
   * 
   */
  String pageUrl = TableUtil.unreverseUrl(key);
  if (RegexUtil.findMatch(pageUrl)){
  LOG.info("Skipping the url : " + pageUrl + " from indexing; matched 
the indexing exclude url patterns.");
  return;
  }
  ==
...
.
{code}
* Add the following property to *??nutch-site.xml??*
{code}

  indexing.exclude.url.patterns.file
  crawl-donot-index-patterns.txt
 
{code}
sample patterns to exclude from indexing(crawl-donot-index-patterns.txt)
{code}
/news/$
/news/latest/$
/videos/$
/music/$
/photos/$
/movies/$
/ontv/$
{code}


was (Author: riyaz):
We have implemented a similar kind of feature for crawling our sites a year ago 
or so, I have come across this ticket so just thought of sharing the 
implementation approach(It's not a plugin approach like existing 
filters/normalizers).

Created a util class for our customization to handle reading different types of 
regex patterns like include and exclude as nutch supports.

(on) Nutch version : 2.1
* org.apache.nutch.util.RegexUtil (source code attached)
Added the following changes to IndexerJob class
* org.apache.nutch.indexer.IndexerJob (attached the source code)
code snippet:
{code} 
package org.apache.nutch.indexer;

import org.apache.nutch.util.RegexUtil;
import org.apache.nutch.util.TableUtil;
public abstract class IndexerJob extends NutchTool implements Tool {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class);
  public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = 
"indexing.exclude.url.patterns.file";

 public void setup(Context context) throws IOException {
.
  String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE);
  if (regexPatternsFileName != null) {
  LOG.info("Loading indexing exculde patterns from the nutch 
configurations:");
  
RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName));
  }
}

public void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
  ParseStatus pstatus = page.getParseStatus();
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
  || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
return; // filter urls not parsed
  }
  ===
 /*
   * To skip the matched url patterns from indexing.
   * 
   */
  String pageUrl = TableUtil.unreverseUrl(key);
  if (RegexUtil.findMatch(pageUrl)){
  LOG.info("Skipping the url : " + pageUrl + " from indexing; matched 
the indexing exclude url patterns.");
  return;
  }
  ==
...
.
{code}
* Add the following property to *??nutch-site.xml??*
{code}

  indexing.exclude.url.patterns.file
  crawl-donot-index-patterns

[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2014-06-12 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919
 ] 

Riyaz Shaik edited comment on NUTCH-1614 at 6/12/14 9:08 AM:
-

We have implemented a similar kind of feature for crawling our sites a year ago 
or so, I have come across this ticket so just thought of sharing the 
implementation approach(It's not a plugin approach like existing 
filters/normalizers).

Created a util class for our customization to handle reading different types of 
regex patterns like include and exclude as nutch supports.

(on) Nutch version : 2.1
* org.apache.nutch.util.RegexUtil (source code attached)
Added the following changes to IndexerJob class
* org.apache.nutch.indexer.IndexerJob (attached the source code)
code snippet:
{code} 
package org.apache.nutch.indexer;

import org.apache.nutch.util.RegexUtil;
import org.apache.nutch.util.TableUtil;
public abstract class IndexerJob extends NutchTool implements Tool {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class);
  public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = 
"indexing.exclude.url.patterns.file";

 public void setup(Context context) throws IOException {
.
  String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE);
  if (regexPatternsFileName != null) {
  LOG.info("Loading indexing exculde patterns from the nutch 
configurations:");
  
RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName));
  }
}

public void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
  ParseStatus pstatus = page.getParseStatus();
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
  || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
return; // filter urls not parsed
  }
  ===
 /*
   * To skip the matched url patterns from indexing.
   * 
   */
  String pageUrl = TableUtil.unreverseUrl(key);
  if (RegexUtil.findMatch(pageUrl)){
  LOG.info("Skipping the url : " + pageUrl + " from indexing; matched 
the indexing exclude url patterns.");
  return;
  }
  ==
...
.
{code}
* Add the following property to *??nutch-site.xml??*
{code}

  indexing.exclude.url.patterns.file
  crawl-donot-index-patterns.txt
 
{code}



was (Author: riyaz):
I have implemented a similar kind of feature for crawling our sites a year ago 
or so, I have come across this ticket so just thought of sharing the 
implementation approach(It's not a plugin approach like existing 
filters/normalizers).

Created a util class for our customization to handle reading different types of 
regex patterns like include and exclude as nutch supports.

(on) Nutch version : 2.1
* org.apache.nutch.util.RegexUtil (source code attached)
Added the following changes to IndexerJob class
* org.apache.nutch.indexer.IndexerJob (attached the source code)
code snippet:
{code} 
package org.apache.nutch.indexer;

import org.apache.nutch.util.RegexUtil;
import org.apache.nutch.util.TableUtil;
public abstract class IndexerJob extends NutchTool implements Tool {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class);
  public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = 
"indexing.exclude.url.patterns.file";

 public void setup(Context context) throws IOException {
.
  String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE);
  if (regexPatternsFileName != null) {
  LOG.info("Loading indexing exculde patterns from the nutch 
configurations:");
  
RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName));
  }
}

public void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
  ParseStatus pstatus = page.getParseStatus();
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
  || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
return; // filter urls not parsed
  }
  ===
 /*
   * To skip the matched url patterns from indexing.
   * 
   */
  String pageUrl = TableUtil.unreverseUrl(key);
  if (RegexUtil.findMatch(pageUrl)){
  LOG.info("Skipping the url : " + pageUrl + " from indexing; matched 
the indexing exclude url patterns.");
  return;
  }
  ==
...
.
{code}
* Add the following property to *??nutch-site.xml??*
{code}

  indexing.exclude.url.patterns.file
  crawl-donot-index-patterns.txt
 
{code}


> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> --

[jira] [Updated] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2014-06-12 Thread Riyaz Shaik (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riyaz Shaik updated NUTCH-1614:
---

Attachment: IndexerJob.java
RegexUtil.java

> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> ---
>
> Key: NUTCH-1614
> URL: https://issues.apache.org/jira/browse/NUTCH-1614
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: plugin
> Attachments: IndexerJob.java, NUTCH-1614.patch, RegexUtil.java
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> 
> indexer.url.filter.exclude.regex.file
> regex-indexer-exclude-urls.txt
> 
> Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
> "#" indicates a comment line and will be ignored.
> 
> 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index

2014-06-12 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028919#comment-14028919
 ] 

Riyaz Shaik commented on NUTCH-1614:


I have implemented a similar kind of feature for crawling our sites a year ago 
or so, I have come across this ticket so just thought of sharing the 
implementation approach(It's not a plugin approach like existing 
filters/normalizers).

Created a util class for our customization to handle reading different types of 
regex patterns like include and exclude as nutch supports.

(on) Nutch version : 2.1
* org.apache.nutch.util.RegexUtil (source code attached)
Added the following changes to IndexerJob class
* org.apache.nutch.indexer.IndexerJob (attached the source code)
code snippet:
{code} 
package org.apache.nutch.indexer;

import org.apache.nutch.util.RegexUtil;
import org.apache.nutch.util.TableUtil;
public abstract class IndexerJob extends NutchTool implements Tool {

  public static final Logger LOG = LoggerFactory.getLogger(IndexerJob.class);
  public static final String INDEXING_EXCLUDE_URL_PATTERNS_FILE = 
"indexing.exclude.url.patterns.file";

 public void setup(Context context) throws IOException {
.
  String regexPatternsFileName = conf.get(INDEXING_EXCLUDE_URL_PATTERNS_FILE);
  if (regexPatternsFileName != null) {
  LOG.info("Loading indexing exculde patterns from the nutch 
configurations:");
  
RegexUtil.loadRegexPatterns(conf.getConfResourceAsReader(regexPatternsFileName));
  }
}

public void map(String key, WebPage page, Context context)
throws IOException, InterruptedException {
  ParseStatus pstatus = page.getParseStatus();
  if (pstatus == null || !ParseStatusUtils.isSuccess(pstatus)
  || pstatus.getMinorCode() == ParseStatusCodes.SUCCESS_REDIRECT) {
return; // filter urls not parsed
  }
  ===
 /*
   * To skip the matched url patterns from indexing.
   * 
   */
  String pageUrl = TableUtil.unreverseUrl(key);
  if (RegexUtil.findMatch(pageUrl)){
  LOG.info("Skipping the url : " + pageUrl + " from indexing; matched 
the indexing exclude url patterns.");
  return;
  }
  ==
...
.
{code}
* Add the following property to *??nutch-site.xml??*
{code}

  indexing.exclude.url.patterns.file
  crawl-donot-index-patterns.txt
 
{code}


> Plugin to exclude URLs matching regex list from indexing - to enable crawl 
> but do not index
> ---
>
> Key: NUTCH-1614
> URL: https://issues.apache.org/jira/browse/NUTCH-1614
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 2.2.1
>Reporter: Brian
>Priority: Minor
>  Labels: plugin
> Attachments: NUTCH-1614.patch
>
>
> Some pages we need to crawl (such as some main pages and different views of a 
> main page) to get all the other pages, but we don't want to index those pages 
> themselves.  Therefore we cannot use the url filter approach.
> This plugin uses a file containing regex strings (see included sample file).  
> If one of the regex strings matches with an entire URL, that URL will be 
> excluded form indexing.
> The file to use is specified by the following property in nutch-site.xml:
> 
> indexer.url.filter.exclude.regex.file
> regex-indexer-exclude-urls.txt
> 
> Holds the file name containing the regex strings.  Any URL 
> matching one of these strings will be excluded from indexing. 
> "#" indicates a comment line and will be ignored.
> 
> 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-24 Thread Riyaz Shaik (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riyaz Shaik updated NUTCH-1457:
---

Attachment: NUTCH-1457(Nutch-2.2.1)-src.zip
NUTCH-1457(Nutch-2.1)-src.zip
NUTCH-1457(Nutch-2.2.1).patch
NUTCH-1457(Nutch-2.1).patch

> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java, NUTCH-1457(Nutch-2.1).patch, 
> NUTCH-1457(Nutch-2.1)-src.zip, NUTCH-1457(Nutch-2.2.1).patch, 
> NUTCH-1457(Nutch-2.2.1)-src.zip
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-24 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718439#comment-13718439
 ] 

Riyaz Shaik commented on NUTCH-1457:


Hi Ferdy/Lewis,

It seems trunk has the Nutch-1.4 version code as per the SVN check-in logs and 
mail archives.

http://www.mail-archive.com/dev@nutch.apache.org/msg04348.html


I had created patches for the branches : *Nutch-2.1* and *Nutch-2.2.1*

Attached the modified source code files as a Zip and patches.

(on) Patch contains following fixes other than NUTCH-1457:

(+) org.apache.nutch.crawl.AbstractFetchSchedule
 * Fix for resetting fetchTime to currentTime, if the *??fetchTime-currTime > 
maxInterval??*. Since *“shouldFetch”* method returning false even after setting 
the new fetchTime to page. So, that new fetchTime changes will not be available 
to GeneratorReducer to persist the changes in HBase.

(+) org.apache.nutch.parse.ParseUtil
 * Moved the page signature calculation code(a line of code).
Existing code calculating the page signature without parsed plain text(Ex: from 
HTMLParser), that causes signature calculation on entire page content even 
after enabling the “org.apache.nutch.crawl.TextProfileSignature”.

Can you please validate the changes?.

Thanks
Riyaz


> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-17 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510
 ] 

Riyaz Shaik edited comment on NUTCH-1457 at 7/17/13 7:34 PM:
-

Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution(UNSCHEDULED status/mark in 
GeneratorMapper) could be "We are updating the few columns data of all the urls 
(SCHEDULED + UNSCHEDULED) in Hbase"  from ??GeneratorReducer??, that might 
reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)

  was (Author: riyaz):
Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution could be "We are updating the few 
columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase"  from 
??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)
  
> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-17 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510
 ] 

Riyaz Shaik commented on NUTCH-1457:


Hi Ferdy,

The below mentioned scenario will not occur:

 *although there might be a problem with code that assumes STATUS_FETCHED, for 
example the ParserJob: It only processes STATUS_FETCHED entries. There may be 
more dependencies.*

Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose 
*??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not 
be processed in the Fetcher/Parser jobs.

One of the  drawaback of this solution could be "We are updating the few 
columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase"  from 
??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance.

We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker 
use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. 
It is working fine and also it overcomes the drawback of our earlier solution.

Will attach the code changes.

Thanks Ferdy.. :)

> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-10 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704435#comment-13704435
 ] 

Riyaz Shaik edited comment on NUTCH-1457 at 7/10/13 11:13 AM:
--

That logic may not work in the following scenario. (fetchtime > currentTime) 
may be true when generateJob is running, but it will return false when the 
DbUpdaterJob is running, if fetch & parse takes too much time.  This will again 
lead to the same issue. Instead, I have made the following fix locally(on 2.1) 
and testing it. It seems to be working fine. It will be great if some one 
validates this fix. 

1. Introduced the new crawl Status on CrawlStatus.java
{code}
  public static final byte STATUS_UNSCHEDULED= 0x20;
  NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled");
{code}
2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page 
status to the new status(UNSCHEDULED) and allow the page to reducer.
{code}
public void map(String reversedUrl, WebPage page,
 Context context) throws IOException, InterruptedException {
….
..
// check fetch schedule
boolean shouldFetch = schedule.shouldFetch(url, page, curTime);
float score = page.getScore();
if (!shouldFetch) {
  page.setStatus(CrawlStatus.STATUS_UNSCHEDULED);
   if (GeneratorJob.LOG.isDebugEnabled()) {
 GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "', 
fetchTime=" +
 page.getFetchTime() + ", curTime=" + curTime);
   }
} else {
  try {
score = scoringFilters.generatorSortValue(url, page, score);
  } catch (ScoringFilterException e) {
//ignore
  }
}

   entry.set(url, score);
context.write(entry, page);
}
{code}
3. In GeneratorReducer.java, skip all other processing for the status 
UNSCHEDULED and persist the data to the webpage.
{code}
protected void reduce(SelectorEntry key, Iterable values,
  Context context) throws IOException, InterruptedException {
for (WebPage page : values) {
  if (count >= limit) {
return;
  }
  if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) {
writeOutput(context, key.url, page);
continue;
  }
  if (maxCount > 0) {
String hostordomain;
if (byDomain) {
  hostordomain = URLUtil.getDomainName(key.url);
} else {
  hostordomain = URLUtil.getHost(key.url);
}

Integer hostCount = hostCountMap.get(hostordomain);
if (hostCount == null) {
  hostCountMap.put(hostordomain, 0);
  hostCount = 0;
}
if (hostCount >= maxCount) {
  return;
}
hostCountMap.put(hostordomain, hostCount + 1);
  }

  Mark.GENERATE_MARK.putMark(page, batchId);
  if (!writeOutput(context, key.url, page)) {
context.getCounter("Generator", "MALFORMED_URL").increment(1);
continue;
  }
  context.getCounter("Generator", "GENERATE_MARK").increment(1);
  count++;
}
  }
{code}

4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is 
UNSCHEDULED and call a regular forceRefetch.
{code}
protected void reduce(UrlWithScore key, Iterable values,
  Context context) throws IOException, InterruptedException {
  ….

..
byte status = (byte)page.getStatus();
  switch (status) {
  case CrawlStatus.STATUS_UNSCHEDULED: // not scheduled for 
generate. due to fetchtime > currentime
if (maxInterval < page.getFetchInterval())
  schedule.forceRefetch(url, page, false);
break;
  case CrawlStatus.STATUS_FETCHED: // succesful fetch
  case CrawlStatus.STATUS_REDIR_TEMP:  // successful fetch, redirected
  case CrawlStatus.STATUS_REDIR_PERM:
  case CrawlStatus.STATUS_NOTMODIFIED: // successful fetch, notmodified
int modified = FetchSchedule.STATUS_UNKNOWN;
if (status == CrawlStatus.STATUS_NOTMODIFIED) {
  modified = FetchSchedule.STATUS_NOTMODIFIED;

…

}
{code}



  was (Author: riyaz):
That logic may not work in the following scenario. (fetchtime > 
currentTime) may be true when generateJob is running, but it will return false 
when the DbUpdaterJob is running, if fetch & parse takes too much time.  This 
will again lead to the same issue. Instead, I have made the following fix 
locally and testing it. It seems to be working fine. It will be great if some 
one validates this fix. 

1. Introduced the new crawl Status on CrawlStatus.java
{code}
  public static final byte STATUS_UNSCHEDULED= 0x20;
  NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled");
{code}
2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page 
status to the new status(UNSCHEDULED) and allow the page to reducer.
{code}
public void map(String reversedUrl, WebPage page,
 Context context) 

[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-10 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704435#comment-13704435
 ] 

Riyaz Shaik commented on NUTCH-1457:


That logic may not work in the following scenario. (fetchtime > currentTime) 
may be true when generateJob is running, but it will return false when the 
DbUpdaterJob is running, if fetch & parse takes too much time.  This will again 
lead to the same issue. Instead, I have made the following fix locally and 
testing it. It seems to be working fine. It will be great if some one validates 
this fix. 

1. Introduced the new crawl Status on CrawlStatus.java
{code}
  public static final byte STATUS_UNSCHEDULED= 0x20;
  NAMES.put(STATUS_UNSCHEDULED, "status_unscheduled");
{code}
2. In GeneratorMapper.java, if the shouldFetch returns false, setting the page 
status to the new status(UNSCHEDULED) and allow the page to reducer.
{code}
public void map(String reversedUrl, WebPage page,
 Context context) throws IOException, InterruptedException {
….
..
// check fetch schedule
boolean shouldFetch = schedule.shouldFetch(url, page, curTime);
float score = page.getScore();
if (!shouldFetch) {
  page.setStatus(CrawlStatus.STATUS_UNSCHEDULED);
   if (GeneratorJob.LOG.isDebugEnabled()) {
 GeneratorJob.LOG.debug("-shouldFetch rejected '" + url + "', 
fetchTime=" +
 page.getFetchTime() + ", curTime=" + curTime);
   }
} else {
  try {
score = scoringFilters.generatorSortValue(url, page, score);
  } catch (ScoringFilterException e) {
//ignore
  }
}

   entry.set(url, score);
context.write(entry, page);
}
{code}
3. In GeneratorReducer.java, skip all other processing for the status 
UNSCHEDULED and persist the data to the webpage.
{code}
protected void reduce(SelectorEntry key, Iterable values,
  Context context) throws IOException, InterruptedException {
for (WebPage page : values) {
  if (count >= limit) {
return;
  }
  if (page.getStatus() == CrawlStatus.STATUS_UNSCHEDULED) {
writeOutput(context, key.url, page);
continue;
  }
  if (maxCount > 0) {
String hostordomain;
if (byDomain) {
  hostordomain = URLUtil.getDomainName(key.url);
} else {
  hostordomain = URLUtil.getHost(key.url);
}

Integer hostCount = hostCountMap.get(hostordomain);
if (hostCount == null) {
  hostCountMap.put(hostordomain, 0);
  hostCount = 0;
}
if (hostCount >= maxCount) {
  return;
}
hostCountMap.put(hostordomain, hostCount + 1);
  }

  Mark.GENERATE_MARK.putMark(page, batchId);
  if (!writeOutput(context, key.url, page)) {
context.getCounter("Generator", "MALFORMED_URL").increment(1);
continue;
  }
  context.getCounter("Generator", "GENERATE_MARK").increment(1);
  count++;
}
  }
{code}

4. In DbUpdateReducer.java, do not call the setFetchSchedule if the status is 
UNSCHEDULED and call a regular forceRefetch.
{code}
protected void reduce(UrlWithScore key, Iterable values,
  Context context) throws IOException, InterruptedException {
  ….

..
byte status = (byte)page.getStatus();
  switch (status) {
  case CrawlStatus.STATUS_UNSCHEDULED: // not scheduled for 
generate. due to fetchtime > currentime
if (maxInterval < page.getFetchInterval())
  schedule.forceRefetch(url, page, false);
break;
  case CrawlStatus.STATUS_FETCHED: // succesful fetch
  case CrawlStatus.STATUS_REDIR_TEMP:  // successful fetch, redirected
  case CrawlStatus.STATUS_REDIR_PERM:
  case CrawlStatus.STATUS_NOTMODIFIED: // successful fetch, notmodified
int modified = FetchSchedule.STATUS_UNKNOWN;
if (status == CrawlStatus.STATUS_NOTMODIFIED) {
  modified = FetchSchedule.STATUS_NOTMODIFIED;

…

}
{code}



> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-07-05 Thread Riyaz Shaik (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700965#comment-13700965
 ] 

Riyaz Shaik commented on NUTCH-1457:


Hi,

Is it possible to have a simple logic like (fetchTime > currentTime) don't 
set/modify the FetchSchedule in DbUpdateReducer?

Thanks
Riyaz

> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira