Re: Indexing static html files

Ryan Smith Sat, 05 Jul 2008 19:00:02 -0700

Hi Winton,
I found my problem.  I was only editing crawl-urlfilter.txt and not
regexp-urlfilter.txt


Thanks for the help.

I have 2 questions:

After i crawl my files, they will be indexed with file:///x/y/z/.......
Is there an chance i can easily change the link prefix to
http://somesite.com/ ?

And i noticed from the tutorial, i only get one path to have nutch to serve
searches for?
http://peterpuwang.googlepages.com/NutchGuideForDummies.htm

d.      Set Your Searcher Directory

Next, navigate to your nutch webapp folder then WEB-INF/classes. Edit the
nutch-site.xml file and add the following to it (make sure you don't have
two sets of <configuration></configuration> tags!):

<configuration>

  <property>

    <name>searcher.dir</name>

    <value>your_crawl_folder_here</value>

  </property>

</configuration>


Can i have nutch search multiple crawl folders?

Thanks again,

-Ryan

On Sat, Jul 5, 2008 at 7:17 PM, Winton Davies <[EMAIL PROTECTED]>
wrote:

> Hi Ryan,
>
> I just used the regular intranet crawl, didnt try to do the inject
>
> W
>
>
> At 6:16 PM -0400 7/5/08, Ryan Smith wrote:
>
>> Winton,
>> I added the override property to nutch-site.xml  ( i saw the one in
>> nutch-default.xml after your email )  , still no urls being added to the
>> crawldb.
>> Can you verify this by trying to inject file urls into a test crawl db?
>> Any other ideas?
>>
>> -Ryan
>>
>> On Sat, Jul 5, 2008 at 5:47 PM, Winton Davies <[EMAIL PROTECTED]>
>> wrote:
>>
>>   Hey Ryan,
>>>
>>>  There's something else, that needs to be set as well - sorry I forgot
>>> about
>>>  it.
>>>
>>>
>>>  <property>
>>>  <name>plugin.includes</name>
>>>
>>>
>>>
>>> <value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>>  </property>
>>>
>>>
>>>  Hope this helps!
>>>
>>>  W
>>>
>>>
>>>
>>>  Hello,
>>>
>>>>  I tried what Winton said.  I generated a file with all the
>>>> file:///x/y/z
>>>>  urls, but nutch wont inject any into the crawldb
>>>>  I even set the crawl-urlfilter.txt  to allow everything:
>>>>  +.
>>>>  It seems like ./bin/nutch crawl   is reading the file, but its finding
>>>> 0
>>>>  urls to fetch.  I test this on http:// links and they get injected.
>>>>  Is there a plugin or something ic an modify to allow file urls to be
>>>>  injected into the crawldb?
>>>>  Thank you.
>>>>  -Ryan
>>>>
>>>>  On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <[EMAIL PROTECTED]
>>>> >
>>>>  wrote:
>>>>
>>>>   Ryan,
>>>>
>>>>>
>>>>>   > You can generate a file of FILE urls (eg)
>>>>>
>>>>
>>>>
>>>>>  file:///x/y/z/file1.html
>>>>>  file:///x/y/z/file2.html
>>>>>
>>>>>  Use find and AWK accordingly to generate this. put it in the url
>>>>>  directory
>>>>>  and just set depth to 1, and change crawl_urlfilter.txt to admit
>>>>>  file:///x/y/z/ (note, if you dont head qualify it, it will apparently
>>>>>  try to
>>>>>  index directories above the base one, by using ../ notation. (I only
>>>>>  read
>>>>>  this, havent tried it).
>>>>>
>>>>>  then just do the intranet crawl example.
>>>>>
>>>>>  NOTE this will NOT (as far as I can see no matter how much tweaking),
>>>>>  use
>>>>>  ANCHOR TEXT or PageRank (OPIC version) for any links in these files.
>>>>> The
>>>>>  ONLY way to do this is to use a webserver as far as I can tell.  Don't
>>>>>  understand the logic, but there you are. Note, if you use a webserver,
>>>>>  be
>>>>>  aware you will have to disable IGNORE.INTERNAL setting in
>>>>> Nutch-Site.xml
>>>>>  (you'll be messing around a lot in here).
>>>>>
>>>>>  Cheers,
>>>>>  Winton
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
>>>>>
>>>>>   Is there a simple way to have nutch index a folder full of other
>>>>>
>>>>>>  folders
>>>>>>  and
>>>>>>  html files?
>>>>>>
>>>>>>  I was hoping to avoid having to run apache to serve the html files,
>>>>>> and
>>>>>>  then
>>>>>>  have nutch crawl the site on apache.
>>>>>>
>>>>>>  Thank you,
>>>>>>  -Ryan
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>

Re: Indexing static html files

Reply via email to