Problem loading a new url-filter inside the generate-fetch loop

Ismael Mon, 03 Dec 2007 09:38:38 -0800

Hello. I am programming in Java using nutch to retrieve PDF documents.
I want to do a filter so in the last iteration of the
generator-fetcher loop only retrieves PDF documents, but I cannot make
it run: if I load that pdf-filter before the first iteration it runs
ok and fetches only PDFs, but if I change this in the last iteration
it just ignore it and generates and fetches all of the reachable
pages.


Here is the code where I change the filter (it is inside the
generate-fetch loop):

                                if (i==depth-1) {
                                        conf = NutchConfiguration.create();

conf.addDefaultResource("pdf-filter.xml");
                                        job = new NutchJob(conf);
                                        generator = new Generator(conf);
                                        fetcher = new Fetcher(conf);
                                        crawlDbTool = new CrawlDb(conf);
                                }
Here is the pdf-filter.xml

                <?xml version="1.0" ?>
                <!--
                 Licensed to the Apache Software Foundation (ASF)
under one or more
                  contributor license agreements.  See the NOTICE file
distributed with
                  this work for additional information regarding
copyright ownership.
                  The ASF licenses this file to You under the Apache
License, Version 2.0
                  (the "License"); you may not use this file except in
compliance with
                  the License.  You may obtain a copy of the License at

                      http://www.apache.org/licenses/LICENSE-2.0

                  Unless required by applicable law or agreed to in
writing, software
                  distributed under the License is distributed on an
"AS IS" BASIS,
                  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or                       implied.
                  See the License for the specific language governing
permissions and
                  limitations under the License.
                -->
                <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

                <configuration>

                <property>
                  <name>urlfilter.regex.file</name>
                  <value>pdf-urlfilter.txt</value>
                </property>


                </configuration>

Here is the pdf-urlfilter.txt:

                # PDF url filter
                # skip file: ftp: and mailto: urls
                -^(file|ftp|mailto):
                # accept only pdf files
                +\.(PDF|pdf|Pdf)$
                # skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
                # reject anything else
                -.

And finally the code which includes the first code I wrote, the code to crawl

                Configuration conf = NutchConfiguration.create();
                JobConf job = new NutchJob(conf);


                Path dir = new Path("tmp/directPDFSearch/"+new
Date().getTime());
                FileSystem fs;
                try {
                        fs = FileSystem.get(job);

                        while (fs.exists(dir)) {
                                dir = new
Path("tmp/directPDFSearch/"+new Date().getTime());
                        }
                        fs.mkdirs(dir);

                        int threads = job.getInt("fetcher.threads.fetch", 10);

                        // Creation of the rootUrl file
                        Path rootUrlDir = new Path(dir+"/rootUrlDir");
                        FSDataOutputStream rootUrlWriter =
fs.create(rootUrlDir);
                        rootUrlWriter.write(rootURL.getBytes());
                        rootUrlWriter.close();

                        // Paths to our directories
                        Path crawlDb = new Path(dir + "/crawldb");
                        Path linkDb = new Path(dir + "/linkdb");
                        Path segments = new Path(dir + "/segments");
                        Path indices = new Path(dir + "/indexes");
                        Path index = new Path(dir + "/index");
                        Path tmpDir =
job.getLocalPath("crawl"+Path.SEPARATOR+new Date().getTime());

                        // Tools that we'll need
                        Injector injector = new Injector(conf);
                        Generator generator = new Generator(conf);
                        ParseSegment parseSegment = new ParseSegment(conf);
                        Fetcher fetcher = new Fetcher(conf);
                        CrawlDb crawlDbTool = new CrawlDb(conf);
                        LinkDb linkDbTool = new LinkDb(conf);
                        Indexer indexer = new Indexer(conf);
                        DeleteDuplicates dedup = new DeleteDuplicates(conf);
                        IndexMerger merger = new IndexMerger(conf);

                        // initialize crawlDb

                        injector.inject(crawlDb, rootUrlDir);

                        int i;
                        for (i = 0; i < depth; i++) {

                                /*
                                 * Segment generation
                                 */
                                if (i==depth-1) {
                                        conf = NutchConfiguration.create();

conf.addDefaultResource("pdf-filter.xml");
                                        job = new NutchJob(conf);
                                        generator = new Generator(conf);
                                        fetcher = new Fetcher(conf);
                                        crawlDbTool = new CrawlDb(conf);
                                }
                                Path segment =
generator.generate(crawlDb, segments, -1, topN, System
                                                .currentTimeMillis(),
false, false);
                                if (segment == null) {
                                        break;
                                }
                                /*
                                 * Fetching proccess
                                 */
                                fetcher.fetch(segment, threads);
                                if (!Fetcher.isParsing(job)) {
                                        parseSegment.parse(segment);
                                }

                                /*
                                 * Crawler database updating
                                 */
                                crawlDbTool.update(crawlDb, new
Path[]{segment}, true, true);
                        }
                        if (i > 0) {
                                /*
                                 * Links inversion
                                 */
                                linkDbTool.invert(linkDb, segments,
true, true, false);
                                /*
                                 * Index creation
                                 */

                                indexer.index(indices, crawlDb,
linkDb, fs.listPaths(segments));
                                dedup.dedup(new Path[] { indices});

                                Path[] indicesPaths = fs.listPaths(indices);
                                merger.merge(indicesPaths, index, tmpDir);
                                result = pdfSearch(dir);
                                //fs.delete(dir);
                        }


Thank you for reading!

Problem loading a new url-filter inside the generate-fetch loop

Reply via email to