Hello. I am programming in Java using nutch to retrieve PDF documents.
I want to do a filter so in the last iteration of the
generator-fetcher loop only retrieves PDF documents, but I cannot make
it run: if I load that pdf-filter before the first iteration it runs
ok and fetches only PDFs, but if I change this in the last iteration
it just ignore it and generates and fetches all of the reachable
pages.
Here is the code where I change the filter (it is inside the
generate-fetch loop):
if (i==depth-1) {
conf = NutchConfiguration.create();
conf.addDefaultResource("pdf-filter.xml");
job = new NutchJob(conf);
generator = new Generator(conf);
fetcher = new Fetcher(conf);
crawlDbTool = new CrawlDb(conf);
}
Here is the pdf-filter.xml
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF)
under one or more
contributor license agreements. See the NOTICE file
distributed with
this work for additional information regarding
copyright ownership.
The ASF licenses this file to You under the Apache
License, Version 2.0
(the "License"); you may not use this file except in
compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in
writing, software
distributed under the License is distributed on an
"AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied.
See the License for the specific language governing
permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>urlfilter.regex.file</name>
<value>pdf-urlfilter.txt</value>
</property>
</configuration>
Here is the pdf-urlfilter.txt:
# PDF url filter
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# accept only pdf files
+\.(PDF|pdf|Pdf)$
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# reject anything else
-.
And finally the code which includes the first code I wrote, the code to crawl
Configuration conf = NutchConfiguration.create();
JobConf job = new NutchJob(conf);
Path dir = new Path("tmp/directPDFSearch/"+new
Date().getTime());
FileSystem fs;
try {
fs = FileSystem.get(job);
while (fs.exists(dir)) {
dir = new
Path("tmp/directPDFSearch/"+new Date().getTime());
}
fs.mkdirs(dir);
int threads = job.getInt("fetcher.threads.fetch", 10);
// Creation of the rootUrl file
Path rootUrlDir = new Path(dir+"/rootUrlDir");
FSDataOutputStream rootUrlWriter =
fs.create(rootUrlDir);
rootUrlWriter.write(rootURL.getBytes());
rootUrlWriter.close();
// Paths to our directories
Path crawlDb = new Path(dir + "/crawldb");
Path linkDb = new Path(dir + "/linkdb");
Path segments = new Path(dir + "/segments");
Path indices = new Path(dir + "/indexes");
Path index = new Path(dir + "/index");
Path tmpDir =
job.getLocalPath("crawl"+Path.SEPARATOR+new Date().getTime());
// Tools that we'll need
Injector injector = new Injector(conf);
Generator generator = new Generator(conf);
ParseSegment parseSegment = new ParseSegment(conf);
Fetcher fetcher = new Fetcher(conf);
CrawlDb crawlDbTool = new CrawlDb(conf);
LinkDb linkDbTool = new LinkDb(conf);
Indexer indexer = new Indexer(conf);
DeleteDuplicates dedup = new DeleteDuplicates(conf);
IndexMerger merger = new IndexMerger(conf);
// initialize crawlDb
injector.inject(crawlDb, rootUrlDir);
int i;
for (i = 0; i < depth; i++) {
/*
* Segment generation
*/
if (i==depth-1) {
conf = NutchConfiguration.create();
conf.addDefaultResource("pdf-filter.xml");
job = new NutchJob(conf);
generator = new Generator(conf);
fetcher = new Fetcher(conf);
crawlDbTool = new CrawlDb(conf);
}
Path segment =
generator.generate(crawlDb, segments, -1, topN, System
.currentTimeMillis(),
false, false);
if (segment == null) {
break;
}
/*
* Fetching proccess
*/
fetcher.fetch(segment, threads);
if (!Fetcher.isParsing(job)) {
parseSegment.parse(segment);
}
/*
* Crawler database updating
*/
crawlDbTool.update(crawlDb, new
Path[]{segment}, true, true);
}
if (i > 0) {
/*
* Links inversion
*/
linkDbTool.invert(linkDb, segments,
true, true, false);
/*
* Index creation
*/
indexer.index(indices, crawlDb,
linkDb, fs.listPaths(segments));
dedup.dedup(new Path[] { indices});
Path[] indicesPaths = fs.listPaths(indices);
merger.merge(indicesPaths, index, tmpDir);
result = pdfSearch(dir);
//fs.delete(dir);
}
Thank you for reading!