[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-09 Thread Alexander Kingson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kingson updated NUTCH-1679:
-
Attachment: NUTCH-1679_3.patch

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.3.1
>
> Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch, NUTCH-1679_3.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-09 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14738216#comment-14738216
 ] 

Alexander Kingson commented on NUTCH-1679:
--

Attaching patch, which is tested for crawling with  depth 2 and datastore hbase.

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.3.1
>
> Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-08-31 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723886#comment-14723886
 ] 

Alexander Kingson commented on NUTCH-1679:
--

I may have a patch in 1-2 weeks.

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.3.1
>
> Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-08-28 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720880#comment-14720880
 ] 

Alexander Kingson commented on NUTCH-1679:
--

I took a look to 1.x updateReducer. I see that it checks if url is in db and if 
it was fetched. 
If it is in the db then its metadata is overridden.

 if (oldSet) {
  // copy metadata from old, if exists
  if (old.getMetaData().size() > 0) {
result.putAllMetaData(old);
// overlay with new, if any
if (fetch.getMetaData().size() > 0)
  result.putAllMetaData(fetch);
  }
  // set the most recent valid value of modifiedTime
  if (old.getModifiedTime() > 0 && fetch.getModifiedTime() == 0) {
result.setModifiedTime(old.getModifiedTime());
  }
}

oldSet means it is in the db already.

The same logic can be added to 2.x.

Thanks.
Alex.

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Priority: Blocker
> Fix For: 2.3.1
>
> Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-08-25 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711774#comment-14711774
 ] 

Alexander Kingson commented on NUTCH-1679:
--

Hi,

It seems to me that in this case full inlink and outlink data will be lost. 
Have you tested it?

As I noted on a related issue, this works in nutch-1.x. We need to translate 
that logic to nutch-2.x if possible.

Thanks.
Alex.

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Priority: Critical
> Fix For: 2.3.1
>
> Attachments: NUTCH-1679-2.patch, NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2015-04-01 Thread Alexander Kingson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kingson updated NUTCH-961:

Attachment: nutch-2.x-boilerpipe.patch

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
> NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
> NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, 
> NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2015-04-01 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391558#comment-14391558
 ] 

Alexander Kingson commented on NUTCH-961:
-

Hello,

Since I was not getting satisfactory results after upgrading to boilerpipe 
1.2.0 with parse-tika (with boilerpipe support)  I have put some code to 
nutch-2.x parser to get the same results as the boilerpipe demo-website. Used 
some code from .v2.patch. 
Attaching the patch.

Thanks.
Alex.

> Expose Tika's boilerpipe support
> 
>
> Key: NUTCH-961
> URL: https://issues.apache.org/jira/browse/NUTCH-961
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: BoilerpipeExtractorRepository.java, 
> NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
> NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
> NUTCH-961-1.5-1.patch, NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, 
> NUTCH-961-2.1-v2.patch, NUTCH-961v2.patch
>
>
> Tika 0.8 comes with the Boilerpipe content handler which can be used to 
> extract boilerplate content from HTML pages. We should see how we can expose 
> Boilerplate in the Nutch cofiguration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-01-29 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297635#comment-14297635
 ] 

Alexander Kingson commented on NUTCH-1922:
--

Hello,

My comment is the same for this patch.  Not closed data store can cause memory 
leaks, I believe.

Also this patch does not solve issues with inlinks and outlinks. Currently, 
they are not correctly saved.
I would suggest to investigate nutch-1.x code to see how it handles inlinks and 
outlinks and transfer that logic to n.2x.
I will do it when I get some time. In the meantime if someone investigates and 
let us know how n1.x works, I greatly appreciate. 

Thanks.
Alex.

> DbUpdater overwrites fetch status for URLs from previous batches, causes 
> repeated re-fetches
> 
>
> Key: NUTCH-1922
> URL: https://issues.apache.org/jira/browse/NUTCH-1922
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Gerhard Gossen
> Fix For: 2.4
>
> Attachments: NUTCH-1922.patch
>
>
> When Nutch 2 finds a link to a URL that was crawled in a previous batch, it 
> resets the fetch status of that URL to {{unfetched}}. This makes this URL 
> available for a re-fetch, even if its crawl interval is not yet over.
> To reproduce, using version 2.3:
> {code}
> # Nutch configuration
> ant runtime
> cd runtime/local
> mkdir seeds
> echo http://www.l3s.de/~gossen/nutch/a.html > seeds/1.txt
> bin/crawl seeds test 2
> {code}
> This uses two files {{a.html}} and {{b.html}} that link to each other.
> In batch 1, Nutch downloads {{a.html}} and discovers the URL of {{b.html}}. 
> In batch 2, Nutch downloads {{b.html}} and discovers the link to {{a.html}}. 
> This should update the score and link fields of {{a.html}}, but not the fetch 
> status. However, when I run {{bin/nutch readdb -crawlId test -url 
> http://www.l3s.de/~gossen/nutch/a.html | grep -a status}}, it returns 
> {{status: 1 (status_unfetched)}}.
> Expected would be {{status: 2 (status_fetched)}}.
> The reason seems to be that DbUpdateReducer assumes that [links to a URL not 
> processed in the same batch always belong to new 
> pages|https://github.com/apache/nutch/blob/release-2.3/src/java/org/apache/nutch/crawl/DbUpdateReducer.java#L97-L109].
>  Before NUTCH-1556, all pages in the crawl DB were processed by the DBUpdate 
> job, but that change skipped all pages with a different batch ID, so I assume 
> that this introduced this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-07-21 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069567#comment-14069567
 ] 

Alexander Kingson edited comment on NUTCH-1679 at 7/22/14 12:46 AM:


Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
}

to reducer class.

Also, I found  another issue, when inlinks data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null&& !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently.  Basically, if not clearing inlinks data in each call to 
reduce function  does not cause overlap of inlinks data between keys then this 
code change solves the issue.

Thanks.
Alex.


was (Author: alxksn):
Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
}

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null&& !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Priority: Critical
> Fix For: 2.3
>
> Attachments: NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-07-21 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069567#comment-14069567
 ] 

Alexander Kingson edited comment on NUTCH-1679 at 7/22/14 12:15 AM:


Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
}

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null&& !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.


was (Author: alxksn):
Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
};

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null&& !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Priority: Critical
> Fix For: 2.3
>
> Attachments: NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-07-21 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069567#comment-14069567
 ] 

Alexander Kingson commented on NUTCH-1679:
--

Hi,

I was suggesting to close the datastore by adding this function 

protected void cleanup(Context context) throws IOException,
InterruptedException {
store.close();
};

to reducer class.

Also, I found  another issue, when inlink data is wiped out in the next 
updatedb stages. To solve this issue I replaced these lines

if (page.getInlinks() != null) {
  page.getInlinks().clear();
}

with
if (page.getInlinks() != null&& !inlinkedScoreData.isEmpty()) {
  page.getInlinks().clear();
}

This code change is tested only in one case, and I am not sure if it solves  
the issue permanently. 

Thanks.
Alex.

> UpdateDb using batchId, link may override crawled page.
> ---
>
> Key: NUTCH-1679
> URL: https://issues.apache.org/jira/browse/NUTCH-1679
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.2.1
>Reporter: Tien Nguyen Manh
>Priority: Critical
> Fix For: 2.3
>
> Attachments: NUTCH-1679.patch
>
>
> The problem is in Hbase store, not sure about other store.
> Suppose at first crawl cycle we crawl link A, then get an outlink B.
> In second cycle we crawl link B which also has a link point to A
> In second updatedb we load only page B from store, and will add A as new link 
> because it doesn't know A already exist in store and will override A.
> UpdateDb must be run without batchId or we must set additionsAllowed=false
> Here are code for new page
>   page = new WebPage();
>   schedule.initializeSchedule(url, page);
>   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
>   try {
> scoringFilters.initialScore(url, page);
>   } catch (ScoringFilterException e) {
> page.setScore(0.0f);
>   }
> new page will override old page status, score, fetchTime, fetchInterval, 
> retries, metadata[CASH_KEY]
>  - i think we can change something here so that new page will only update one 
> column for example 'link' and if it is really a new page, we can initialize 
> all above fields in generator
> - or we add operator checkAndPut to store so when add new page we will check 
> if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

2014-04-17 Thread Alexander Kingson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kingson updated NUTCH-1714:
-

Attachment: NUTCH-1714_NUTCH-1714_v2_v3.patch

Replacing patch that counts changes to index-metadata plugin. With these 
changes, nutch  compiles but the plugin itself is not tested.

> Nutch 2.x upgrade to use GORA_94 branch
> ---
>
> Key: NUTCH-1714
> URL: https://issues.apache.org/jira/browse/NUTCH-1714
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Alparslan Avcı
> Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
> NUTCH-1714v2.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
> details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

2014-04-17 Thread Alexander Kingson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kingson updated NUTCH-1714:
-

Attachment: (was: NUTCH-1714_NUTCH-1714_v2_v3.patch)

> Nutch 2.x upgrade to use GORA_94 branch
> ---
>
> Key: NUTCH-1714
> URL: https://issues.apache.org/jira/browse/NUTCH-1714
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Alparslan Avcı
> Attachments: NUTCH-1714.patch, NUTCH-1714v2.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
> details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

2014-04-15 Thread Alexander Kingson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kingson updated NUTCH-1714:
-

Attachment: NUTCH-1714_NUTCH-1714_v2_v3.patch

Hello, 

Attaching patch that is a result of combination and modification, to suit the 
current trunk, of two previous patches and some additional changes to code 
discussed in  
http://lucene.472066.n3.nabble.com/nutch-2-x-with-hbase-filter-option-td4121242.html#a4130227

Thanks.
Alex.

> Nutch 2.x upgrade to use GORA_94 branch
> ---
>
> Key: NUTCH-1714
> URL: https://issues.apache.org/jira/browse/NUTCH-1714
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Alparslan Avcı
> Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
> NUTCH-1714v2.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
> details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-945) Indexing to multiple SOLR Servers

2013-01-28 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564953#comment-13564953
 ] 

Alexander Kingson commented on NUTCH-945:
-

 I see that the issue is unresolved.Is this patch working?

> Indexing to multiple SOLR Servers
> -
>
> Key: NUTCH-945
> URL: https://issues.apache.org/jira/browse/NUTCH-945
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.2
>Reporter: Charan Malemarpuram
> Fix For: 2.2
>
> Attachments: MurmurHashPartitioner.java, 
> NonPartitioningPartitioner.java, patch-NUTCH-945.txt
>
>
> It would be nice to have a default Indexer in Nutch, which can submit docs to 
> multiple SOLR Servers.
> > Partitioning is always the question, when writing to multiple SOLR Servers.
> > Default partitioning can be a simple hashcode based distribution with 
> > addition hooks to customization.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2012-11-08 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493413#comment-13493413
 ] 

Alexander Kingson commented on NUTCH-1457:
--

Hi,

Could you please give me more details on this. It is my understanding that in 
Nutch-1.x only generated in each step urls are updated by updatedb and to 
implement the same processing in Nutch-2.x is to add batchId to updatedb 
command.

Thanks.
Alexander.

> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2012-11-05 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491124#comment-13491124
 ] 

Alexander Kingson commented on NUTCH-1457:
--

Can we use batchId in update command and update only those entries that has the 
given batchId as generate_mark value?

> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> 
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1456) Updater not setting batchId in markers correctly.

2012-09-05 Thread Alexander Kingson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kingson updated NUTCH-1456:
-

Attachment: setUpdtMrkr.patch

> Updater not setting batchId in markers correctly.
> -
>
> Key: NUTCH-1456
> URL: https://issues.apache.org/jira/browse/NUTCH-1456
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: setUpdtMrkr.patch
>
>
> The db updater job is not setting batchId in markers correctly. (Noticed 
> thanks to various reporters on mailing list.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1411) nutchgora fetcher.store.content does not work

2012-09-04 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447961#comment-13447961
 ] 

Alexander Kingson commented on NUTCH-1411:
--

Hi,
This patch was tested against nutch-2.0 release. Could you please make sure 
that the patch was applied successfully and nutch was rebuilt again. Also, I 
suggest to clean the hbase datastore before testing. 

Thanks.
Alex.

> nutchgora fetcher.store.content does not work
> -
>
> Key: NUTCH-1411
> URL: https://issues.apache.org/jira/browse/NUTCH-1411
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
>Priority: Minor
> Attachments: storeContent.patch
>
>
> http://lucene.472066.n3.nabble.com/parse-and-solrindex-in-nutch-2-0-td3991247.html
> The property fetcher.store.content doesn't do anything. Content is always 
> stored. Fix or remove property, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1411) nutchgora fetcher.store.content does not work

2012-07-06 Thread Alexander Kingson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408311#comment-13408311
 ] 

Alexander Kingson edited comment on NUTCH-1411 at 7/6/12 8:54 PM:
--

This patch is tested with mysql storage, only.

  was (Author: alxksn):
This patch is tested mysql storage, only.
  
> nutchgora fetcher.store.content does not work
> -
>
> Key: NUTCH-1411
> URL: https://issues.apache.org/jira/browse/NUTCH-1411
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
>Priority: Minor
> Attachments: storeContent.patch
>
>
> http://lucene.472066.n3.nabble.com/parse-and-solrindex-in-nutch-2-0-td3991247.html
> The property fetcher.store.content doesn't do anything. Content is always 
> stored. Fix or remove property, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1411) nutchgora fetcher.store.content does not work

2012-07-06 Thread Alexander Kingson (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kingson updated NUTCH-1411:
-

Attachment: storeContent.patch

This patch is tested mysql storage, only.

> nutchgora fetcher.store.content does not work
> -
>
> Key: NUTCH-1411
> URL: https://issues.apache.org/jira/browse/NUTCH-1411
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Ferdy Galema
>Priority: Minor
> Attachments: storeContent.patch
>
>
> http://lucene.472066.n3.nabble.com/parse-and-solrindex-in-nutch-2-0-td3991247.html
> The property fetcher.store.content doesn't do anything. Content is always 
> stored. Fix or remove property, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira