[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Fix Version/s: 1.1

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-22 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: NUTCH-762-v3.patch

new patch which reintroduces the 'generator.update.crawldb' functionality 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.1

 Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: (was: NUTCH-762-MultiGenerator.patch)

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2010-03-06 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: NUTCH-762-v2.patch

Improved version of the patch : 

- fixed a few minor bugs
- renamed Generator into OldGenerator
- renamed MultiGenerator into Generator
- fixed test classes to use new Generator
- documented parameters in nutch-default.xml
- add names of segments to the LOG to facilitate integration in scripts
- PartitionUrlByHost is replaced by URLPartitioner which is more generic

I decided to keep the old version for the time being but we might as well get 
rid of it altogether. The new version is now used in the Crawl class. 

Would be nice if people could give it a good try before we put it in 1.1

Thanks

Julien 

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Attachments: NUTCH-762-v2.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-762) Alternative Generator which can generate several segments in one parse of the crawlDB

2009-11-03 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-762:


Attachment: NUTCH-762-MultiGenerator.patch

Patch for the MultiGenerator

 Alternative Generator which can generate several segments in one parse of the 
 crawlDB
 -

 Key: NUTCH-762
 URL: https://issues.apache.org/jira/browse/NUTCH-762
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Affects Versions: 1.0.0
Reporter: Julien Nioche
 Attachments: NUTCH-762-MultiGenerator.patch


 When using Nutch on a large scale (e.g. billions of URLs), the operations 
 related to the crawlDB (generate - update) tend to take the biggest part of 
 the time. One solution is to limit such operations to a minimum by generating 
 several fetchlists in one parse of the crawlDB then update the Db only once 
 on several segments. The existing Generator allows several successive runs by 
 generating a copy of the crawlDB and marking the URLs to be fetched. In 
 practice this approach does not work well as we need to read the whole 
 crawlDB as many time as we generate a segment.
 The patch attached contains an implementation of a MultiGenerator  which can 
 generate several fetchlists by reading the crawlDB only once. The 
 MultiGenerator differs from the Generator in other aspects: 
 * can filter the URLs by score
 * normalisation is optional
 * IP resolution is done ONLY on the entries which have been selected for  
 fetching (during the partitioning). Running the IP resolution on the whole 
 crawlDb is too slow to be usable on a large scale
 * can max the number of URLs per host or domain (but not by IP)
 * can choose to partition by host, domain or IP
 Typically the same unit (e.g. domain) would be used for maxing the URLs and 
 for partitioning; however as we can't count the max number of URLs by IP 
 another unit must be chosen while partitioning by IP. 
 We found that using a filter on the score can dramatically improve the 
 performance as this reduces the amount of data being sent to the reducers.
 The MultiGenerator is called via : nutch 
 org.apache.nutch.crawl.MultiGenerator ...
 with the following options :
 MultiGenerator crawldb segments_dir [-force] [-topN N] [-numFetchers 
 numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
 where most parameters are similar to the default Generator - apart from : 
 -noNorm (explicit)
 -topN : max number of URLs per segment
 -maxNumSegments : the actual number of segments generated could be less than 
 the max value select e.g. not enough URLs are available for fetching and fit 
 in less segments
 Please give it a try and less me know what you think of it
 Julien Nioche
 http://www.digitalpebble.com
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.