Filter rejecting url

Siddhartha Sandhu Tue, 10 Mar 2015 22:30:52 -0700

Hi,

I am running the command:


root@ubuntu:/usr/lib/nutch/nutch/runtime/local/bin# ./nutch inject 
../../../urls/
InjectorJob: starting at 2015-03-10 02:24:40
InjectorJob: Injecting urlDir: ../../../urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora 
storage class.
InjectorJob: total number of urls rejected by filters: 1
InjectorJob: total number of urls injected after normalization and filtering: 0
Injector: finished at 2015-03-10 02:24:48, elapsed: 00:00:08

My "../../../urls/" contains a txt file with value:
http://www.yahoo.com

My regex-urlfilter.txt is:


# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|js|JS)$
+\.(JPG|jpg|PNG|png|jpeg|JPEG|BMP|bmp)
# skip URLs containing certain characters as probable queries, etc.
-.*[*!@].*
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.*



My nutch-site.xml contains:




<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
   <name>http.agent.name</name>
   <value>My Nutch Spider</value>
</property>

<property>
   <name>storage.data.store.class</name>
   <value>org.apache.gora.hbase.store.HBaseStore</value>
   <description>Default class for storing data</description>
</property>

</configuration>

Log entry for corresponding run in nutch/runtime/local/logs/hadoop.log is:



2015-03-10 02:24:46,429 WARN  snappy.LoadSnappy - Snappy native library not 
loaded
2015-03-10 02:24:47,884 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2015-03-10 02:24:47,900 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2015-03-10 02:24:48,949 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls rejected by filters: 1
2015-03-10 02:24:48,951 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls injected after normalization and filtering: 0
2015-03-10 02:24:48,952 INFO  crawl.InjectorJob - Injector: finished at 
2015-03-10 02:24:48, elapsed: 00:00:08


Hbase scan at this point:

scan 'hbase'

ROW COLUMN+CELL0 row(s) in 0.0090 seconds



Also, I am using ubuntu and version of Nutch is 2.3.


I need help identifying the part where I could be missing something critical 
information in the documentation or pointer to where things could be going 
wrong.


Thank You!

Sid.

Filter rejecting url

Reply via email to