Hi, the reason is clearly in the URL filters. The single injected URL does not pass the filter:
> InjectorJob: total number of urls rejected by filters: 1 > InjectorJob: total number of urls injected after normalization and filtering: > 0 Please, check which URL filters are activated via property plugin.includes. And check all configurations files of the active URL filters. There is also a usefule tool: bin/nutch org.apache.nutch.net.URLFilterChecker Cheers, Sebastian On 03/10/2015 10:40 AM, Siddhartha Sandhu wrote: > Hi, > > I am running the command: > > root@ubuntu:/usr/lib/nutch/nutch/runtime/local/bin# ./nutch inject > ../../../urls/ > InjectorJob: starting at 2015-03-10 02:24:40 > InjectorJob: Injecting urlDir: ../../../urls > InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora > storage class. > InjectorJob: total number of urls rejected by filters: 1 > InjectorJob: total number of urls injected after normalization and filtering: > 0 > Injector: finished at 2015-03-10 02:24:48, elapsed: 00:00:08 > > My "../../../urls/" contains a txt file with value: > http://www.yahoo.com > > My regex-urlfilter.txt is: > > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > -\.(ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|js|JS)$ > +\.(JPG|jpg|PNG|png|jpeg|JPEG|BMP|bmp) > # skip URLs containing certain characters as probable queries, etc. > -.*[*!@].* > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > +.* > > > > My nutch-site.xml contains: > > > > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>http.agent.name</name> > <value>My Nutch Spider</value> > </property> > > <property> > <name>storage.data.store.class</name> > <value>org.apache.gora.hbase.store.HBaseStore</value> > <description>Default class for storing data</description> > </property> > > </configuration> > > Log entry for corresponding run in nutch/runtime/local/logs/hadoop.log is: > > > > 2015-03-10 02:24:46,429 WARN snappy.LoadSnappy - Snappy native library not > loaded > 2015-03-10 02:24:47,884 INFO regex.RegexURLNormalizer - can't find rules for > scope 'inject', using > default > 2015-03-10 02:24:47,900 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > 2015-03-10 02:24:48,949 INFO crawl.InjectorJob - InjectorJob: total number > of urls rejected by > filters: 1 > 2015-03-10 02:24:48,951 INFO crawl.InjectorJob - InjectorJob: total number > of urls injected after > normalization and filtering: 0 > 2015-03-10 02:24:48,952 INFO crawl.InjectorJob - Injector: finished at > 2015-03-10 02:24:48, > elapsed: 00:00:08 > > > Hbase scan at this point: > >> scan 'hbase' > > ROW > COLUMN+CELL > > > 0 row(s) in 0.0090 seconds > > > Also, I am using ubuntu and version of Nutch is 2.3. > > > I need help identifying the part where I could be missing something critical > information in the > documentation or pointer to where things could be going wrong. > > > Thank You! > > Sid. >

