Re: Indexing static html files

Ryan Smith Sat, 05 Jul 2008 15:16:37 -0700

Winton,
I added the override property to nutch-site.xml  ( i saw the one in
nutch-default.xml after your email )  , still no urls being added to the
crawldb.
Can you verify this by trying to inject file urls into a test crawl db?
Any other ideas?


-Ryan

On Sat, Jul 5, 2008 at 5:47 PM, Winton Davies <[EMAIL PROTECTED]>
wrote:

> Hey Ryan,
>
> There's something else, that needs to be set as well - sorry I forgot about
> it.
>
>
> <property>
>  <name>plugin.includes</name>
>
> <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
>
>
> Hope this helps!
>
> W
>
>
>
>  Hello,
>> I tried what Winton said.  I generated a file with all the file:///x/y/z
>> urls, but nutch wont inject any into the crawldb
>> I even set the crawl-urlfilter.txt  to allow everything:
>> +.
>> It seems like ./bin/nutch crawl   is reading the file, but its finding 0
>> urls to fetch.  I test this on http:// links and they get injected.
>> Is there a plugin or something ic an modify to allow file urls to be
>> injected into the crawldb?
>> Thank you.
>> -Ryan
>>
>> On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <[EMAIL PROTECTED]>
>> wrote:
>>
>>   Ryan,
>>>
>>>   > You can generate a file of FILE urls (eg)
>>
>>>
>>>  file:///x/y/z/file1.html
>>>  file:///x/y/z/file2.html
>>>
>>>  Use find and AWK accordingly to generate this. put it in the url
>>> directory
>>>  and just set depth to 1, and change crawl_urlfilter.txt to admit
>>>  file:///x/y/z/ (note, if you dont head qualify it, it will apparently
>>> try to
>>>  index directories above the base one, by using ../ notation. (I only
>>> read
>>>  this, havent tried it).
>>>
>>>  then just do the intranet crawl example.
>>>
>>>  NOTE this will NOT (as far as I can see no matter how much tweaking),
>>> use
>>>  ANCHOR TEXT or PageRank (OPIC version) for any links in these files. The
>>>  ONLY way to do this is to use a webserver as far as I can tell.  Don't
>>>  understand the logic, but there you are. Note, if you use a webserver,
>>> be
>>>  aware you will have to disable IGNORE.INTERNAL setting in Nutch-Site.xml
>>>  (you'll be messing around a lot in here).
>>>
>>>  Cheers,
>>>  Winton
>>>
>>>
>>>
>>>
>>>  At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
>>>
>>>   Is there a simple way to have nutch index a folder full of other
>>>> folders
>>>>  and
>>>>  html files?
>>>>
>>>>  I was hoping to avoid having to run apache to serve the html files, and
>>>>  then
>>>>  have nutch crawl the site on apache.
>>>>
>>>>  Thank you,
>>>>  -Ryan
>>>>
>>>>
>>>
>>>
>

Re: Indexing static html files

Reply via email to