Re: Merging CrawlDBs
Thanks for the info Sebastian. Re: Why do you want to merge the data structures? To help inform my crawl strategy I am trying to see what is possible and it feels like having the ability to run concurrent crawls might get around any limitations in the software. I am currently seeding a set of domains to act as a foundation for my crawling and I am performing more targeted crawls (by domain). As I discover more domains I want to crawl, I want to see if I can kick off a new crawler while another one is in progress and then merge the 2 later on. I expect that once I have a solid foundation that I will probably only have a single crawler running on a single DB. On Thu, Feb 2, 2023 at 4:09 AM Sebastian Nagel wrote: > Hi Kamil, > > > I was wondering if this script is advisable to use? > > I haven't tried the script itself but some of the underlying commands > - mergedb, etc. > > > merge command ($nutch_dir/nutch merge $index_dir $new_indexes) > > Of course, some of the commands are obsolete. Long time ago, Nutch > used Lucene index shards directly. Now the management of indexes > (including merging of shards) is delegated to Solr or Elasticsearch. > > > > I plan to use it for crawls of non-overlapping urls. > > ... just a few thoughts about this particular use case: > > Why you want to merge the data structures? > > - if they're disjoint there is no need for it > - all operations (CrawlDb: generate, update, etc.) >are much faster on smaller structures > > If required: most of the Nutch jobs can read multiple segments or CrawlDbs. > However, it might be that the command-line tool expects only a single > CrawlDb or segment. > - we could extend the command-line params > - or just copy the sequence files into one single path > > ~Sebastian > > On 2/2/23 01:54, Kamil Mroczek wrote: > > Hi, > > > > I am testing how merging crawls works and found this script > > https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl. > > > > I was wondering if this script is advisable to use? I plan to use it for > > crawls of non-overlapping urls. > > > > I am wary of using it since it is located under "Archive & Legacy" on the > > wiki. But after running some tests it seems to function correctly. I only > > had to remove the merge command ($nutch_dir/nutch merge $index_dir > > $new_indexes)since that is not a command anymore. > > > > I am not necessarily looking for a list of potential issues (if the list > is > > long), just trying to understand why it might be under the archive. > > > > Kamil > > >
Re: Merging CrawlDBs
Hi Kamil, > I was wondering if this script is advisable to use? I haven't tried the script itself but some of the underlying commands - mergedb, etc. > merge command ($nutch_dir/nutch merge $index_dir $new_indexes) Of course, some of the commands are obsolete. Long time ago, Nutch used Lucene index shards directly. Now the management of indexes (including merging of shards) is delegated to Solr or Elasticsearch. > I plan to use it for crawls of non-overlapping urls. ... just a few thoughts about this particular use case: Why you want to merge the data structures? - if they're disjoint there is no need for it - all operations (CrawlDb: generate, update, etc.) are much faster on smaller structures If required: most of the Nutch jobs can read multiple segments or CrawlDbs. However, it might be that the command-line tool expects only a single CrawlDb or segment. - we could extend the command-line params - or just copy the sequence files into one single path ~Sebastian On 2/2/23 01:54, Kamil Mroczek wrote: Hi, I am testing how merging crawls works and found this script https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl. I was wondering if this script is advisable to use? I plan to use it for crawls of non-overlapping urls. I am wary of using it since it is located under "Archive & Legacy" on the wiki. But after running some tests it seems to function correctly. I only had to remove the merge command ($nutch_dir/nutch merge $index_dir $new_indexes)since that is not a command anymore. I am not necessarily looking for a list of potential issues (if the list is long), just trying to understand why it might be under the archive. Kamil
Merging CrawlDBs
Hi, I am testing how merging crawls works and found this script https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl. I was wondering if this script is advisable to use? I plan to use it for crawls of non-overlapping urls. I am wary of using it since it is located under "Archive & Legacy" on the wiki. But after running some tests it seems to function correctly. I only had to remove the merge command ($nutch_dir/nutch merge $index_dir $new_indexes)since that is not a command anymore. I am not necessarily looking for a list of potential issues (if the list is long), just trying to understand why it might be under the archive. Kamil
Re: Merging crawldbs and linkdbs during incremental crawl
Hi, Just checking if anyone could comment on my post below. :) Thanks in advance. Safdar On Mon, Jun 11, 2012 at 8:10 AM, Ali Safdar Kureishy safdar.kurei...@gmail.com wrote: Hi, I'm trying to build an incremental crawler, using the various Nutch crawl tools (generate + fetch/parse + updatedb etc.). By incremental I mean I want crawled pages to show up quickly in the index (instead of waiting till the end of the crawl). So, I'd like to index as soon as I have fetched a segment. The requirement to invoke update-db and invert-links at the end of each fetch+parse phase (before solrindex and before the next generate) can slow down this crawl. Instead, here is what I'm thinking of doing for each segment (after fetch+parse): 1) Invoke update-db and invert-links to local crawldb and linkdb folders (within the segment). 2) Invoke solr-index using these local crawldb and linkdb folders, 3) Do steps 1-2 for a few pre-generated segments (I would have pre-generated several mutually-exclusive segments before step 1) 4) *Merge* these local crawldbs and linkdbs into the master crawldb and linkdb 5) Proceed to generate the next set of segments from the merged master crawldb and linkdb Do you see any problem with this approach? More specifically: a) is an updatedb (to a local crawldb) followed by a mergedb (to the master crawldb) the same as doing an updatedb directly to the master crawldb? And similarly, b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the master linkdb) the same as doing an invertlinks directly to the master linkdb? Thanks in advance! Regards, Safdar
Merging crawldbs and linkdbs during incremental crawl
Hi, I'm trying to build an incremental crawler, using the various Nutch crawl tools (generate + fetch/parse + updatedb etc.). By incremental I mean I want crawled pages to show up quickly in the index (instead of waiting till the end of the crawl). So, I'd like to index as soon as I have fetched a segment. The requirement to invoke update-db and invert-links at the end of each fetch+parse phase (before solrindex and before the next generate) can slow down this crawl. Instead, here is what I'm thinking of doing for each segment (after fetch+parse): 1) Invoke update-db and invert-links to local crawldb and linkdb folders (within the segment). 2) Invoke solr-index using these local crawldb and linkdb folders, 3) Do steps 1-2 for a few pre-generated segments (I would have pre-generated several mutually-exclusive segments before step 1) 4) *Merge* these local crawldbs and linkdbs into the master crawldb and linkdb 5) Proceed to generate the next set of segments from the merged master crawldb and linkdb Do you see any problem with this approach? More specifically: a) is an updatedb (to a local crawldb) followed by a mergedb (to the master crawldb) the same as doing an updatedb directly to the master crawldb? And similarly, b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the master linkdb) the same as doing an invertlinks directly to the master linkdb? Thanks in advance! Regards, Safdar