Re: Merging CrawlDBs

2023-02-02 Thread Kamil Mroczek
Thanks for the info Sebastian.

Re: Why do you want to merge the data structures?

To help inform my crawl strategy I am trying to see what is possible and
it feels like having the ability to run concurrent crawls might get around
any limitations in the software. I am currently seeding a set of domains to
act as a foundation for my crawling and I am performing more targeted
crawls (by domain). As I discover more domains I want to crawl, I want to
see if I can kick off a new crawler while another one is in progress and
then merge the 2 later on. I expect that once I have a solid foundation
that I will probably only have a single crawler running on a single DB.



On Thu, Feb 2, 2023 at 4:09 AM Sebastian Nagel
 wrote:

> Hi Kamil,
>
>  > I was wondering if this script is advisable to use?
>
> I haven't tried the script itself but some of the underlying commands
> - mergedb, etc.
>
>  > merge command ($nutch_dir/nutch merge $index_dir $new_indexes)
>
> Of course, some of the commands are obsolete. Long time ago, Nutch
> used Lucene index shards directly. Now the management of indexes
> (including merging of shards) is delegated to Solr or Elasticsearch.
>
>
>  > I plan to use it for crawls of non-overlapping urls.
>
> ... just a few thoughts about this particular use case:
>
> Why you want to merge the data structures?
>
> - if they're disjoint there is no need for it
> - all operations (CrawlDb: generate, update, etc.)
>are much faster on smaller structures
>
> If required: most of the Nutch jobs can read multiple segments or CrawlDbs.
> However, it might be that the command-line tool expects only a single
> CrawlDb or segment.
> - we could extend the command-line params
> - or just copy the sequence files into one single path
>
> ~Sebastian
>
> On 2/2/23 01:54, Kamil Mroczek wrote:
> > Hi,
> >
> > I am testing how merging crawls works and found this script
> > https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl.
> >
> > I was wondering if this script is advisable to use? I plan to use it for
> > crawls of non-overlapping urls.
> >
> > I am wary of using it since it is located under "Archive & Legacy" on the
> > wiki. But after running some tests it seems to function correctly. I only
> > had to remove the merge command ($nutch_dir/nutch merge $index_dir
> > $new_indexes)since that is not a command anymore.
> >
> > I am not necessarily looking for a list of potential issues (if the list
> is
> > long), just trying to understand why it might be under the archive.
> >
> > Kamil
> >
>


Re: Merging CrawlDBs

2023-02-02 Thread Sebastian Nagel

Hi Kamil,

> I was wondering if this script is advisable to use?

I haven't tried the script itself but some of the underlying commands
- mergedb, etc.

> merge command ($nutch_dir/nutch merge $index_dir $new_indexes)

Of course, some of the commands are obsolete. Long time ago, Nutch
used Lucene index shards directly. Now the management of indexes
(including merging of shards) is delegated to Solr or Elasticsearch.


> I plan to use it for crawls of non-overlapping urls.

... just a few thoughts about this particular use case:

Why you want to merge the data structures?

- if they're disjoint there is no need for it
- all operations (CrawlDb: generate, update, etc.)
  are much faster on smaller structures

If required: most of the Nutch jobs can read multiple segments or CrawlDbs.
However, it might be that the command-line tool expects only a single
CrawlDb or segment.
- we could extend the command-line params
- or just copy the sequence files into one single path

~Sebastian

On 2/2/23 01:54, Kamil Mroczek wrote:

Hi,

I am testing how merging crawls works and found this script
https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl.

I was wondering if this script is advisable to use? I plan to use it for
crawls of non-overlapping urls.

I am wary of using it since it is located under "Archive & Legacy" on the
wiki. But after running some tests it seems to function correctly. I only
had to remove the merge command ($nutch_dir/nutch merge $index_dir
$new_indexes)since that is not a command anymore.

I am not necessarily looking for a list of potential issues (if the list is
long), just trying to understand why it might be under the archive.

Kamil



Merging CrawlDBs

2023-02-01 Thread Kamil Mroczek
Hi,

I am testing how merging crawls works and found this script
https://cwiki.apache.org/confluence/display/NUTCH/MergeCrawl.

I was wondering if this script is advisable to use? I plan to use it for
crawls of non-overlapping urls.

I am wary of using it since it is located under "Archive & Legacy" on the
wiki. But after running some tests it seems to function correctly. I only
had to remove the merge command ($nutch_dir/nutch merge $index_dir
$new_indexes)since that is not a command anymore.

I am not necessarily looking for a list of potential issues (if the list is
long), just trying to understand why it might be under the archive.

Kamil


Re: Merging crawldbs and linkdbs during incremental crawl

2012-06-12 Thread Ali Safdar Kureishy
Hi,

Just checking if anyone could comment on my post below. :)

Thanks in advance.

Safdar


On Mon, Jun 11, 2012 at 8:10 AM, Ali Safdar Kureishy 
safdar.kurei...@gmail.com wrote:

 Hi,

 I'm trying to build an incremental crawler, using the various Nutch
 crawl tools (generate + fetch/parse + updatedb etc.). By incremental I
 mean I want crawled pages to show up quickly in the index (instead of
 waiting till the end of the crawl). So, I'd like to index as soon as I have
 fetched a segment.

 The requirement to invoke update-db and invert-links at the end of each
 fetch+parse phase (before solrindex and before the next generate) can slow
 down this crawl. Instead, here is what I'm thinking of doing for each
 segment (after fetch+parse):
 1) Invoke update-db and invert-links to local crawldb and linkdb folders
 (within the segment).
 2) Invoke solr-index using these local crawldb and linkdb folders,
 3) Do steps 1-2 for a few pre-generated segments (I would have
 pre-generated several mutually-exclusive segments before step 1)
 4) *Merge* these local crawldbs and linkdbs into the master crawldb and
 linkdb
 5) Proceed to generate the next set of segments from the merged master
 crawldb and linkdb

 Do you see any problem with this approach? More specifically:
 a) is an updatedb (to a local crawldb) followed by a mergedb (to the
 master crawldb) the same as doing an updatedb directly to the master
 crawldb? And similarly,
 b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the
 master linkdb) the same as doing an invertlinks directly to the master
 linkdb?

 Thanks in advance!

 Regards,
 Safdar



Merging crawldbs and linkdbs during incremental crawl

2012-06-10 Thread Ali Safdar Kureishy
Hi,

I'm trying to build an incremental crawler, using the various Nutch crawl
tools (generate + fetch/parse + updatedb etc.). By incremental I mean I
want crawled pages to show up quickly in the index (instead of waiting till
the end of the crawl). So, I'd like to index as soon as I have fetched a
segment.

The requirement to invoke update-db and invert-links at the end of each
fetch+parse phase (before solrindex and before the next generate) can slow
down this crawl. Instead, here is what I'm thinking of doing for each
segment (after fetch+parse):
1) Invoke update-db and invert-links to local crawldb and linkdb folders
(within the segment).
2) Invoke solr-index using these local crawldb and linkdb folders,
3) Do steps 1-2 for a few pre-generated segments (I would have
pre-generated several mutually-exclusive segments before step 1)
4) *Merge* these local crawldbs and linkdbs into the master crawldb and
linkdb
5) Proceed to generate the next set of segments from the merged master
crawldb and linkdb

Do you see any problem with this approach? More specifically:
a) is an updatedb (to a local crawldb) followed by a mergedb (to the master
crawldb) the same as doing an updatedb directly to the master crawldb? And
similarly,
b) is an invertlinks (to a local linkdb) followed by a mergelinkdb (to the
master linkdb) the same as doing an invertlinks directly to the master
linkdb?

Thanks in advance!

Regards,
Safdar