Not without modifying the code. I dont think it respects <BASE> for
example, if you crawl it as File:///
Frankly if you can, just serve it thru DOCROOT - it will be less
painful in the end!
- Serving URL - You can change it if you know how to set up Tomcat.
Winton
Hi Winton,
I found my problem. I was only editing crawl-urlfilter.txt and not
regexp-urlfilter.txt
Thanks for the help.
I have 2 questions:
After i crawl my files, they will be indexed with file:///x/y/z/.......
Is there an chance i can easily change the link prefix to
http://somesite.com/ ?
And i noticed from the tutorial, i only get one path to have nutch to serve
searches for?
http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
d. Set Your Searcher Directory
Next, navigate to your nutch webapp folder then WEB-INF/classes. Edit the
nutch-site.xml file and add the following to it (make sure you don't have
two sets of <configuration></configuration> tags!):
<configuration>
<property>
<name>searcher.dir</name>
<value>your_crawl_folder_here</value>
</property>
</configuration>
Can i have nutch search multiple crawl folders?
Thanks again,
-Ryan
On Sat, Jul 5, 2008 at 7:17 PM, Winton Davies <[EMAIL PROTECTED]>
wrote:
Hi Ryan,
I just used the regular intranet crawl, didnt try to do the inject
W
At 6:16 PM -0400 7/5/08, Ryan Smith wrote:
Winton,
I added the override property to nutch-site.xml ( i saw the one in
nutch-default.xml after your email ) , still no urls being added to the
crawldb.
Can you verify this by trying to inject file urls into a test crawl db?
Any other ideas?
-Ryan
On Sat, Jul 5, 2008 at 5:47 PM, Winton Davies <[EMAIL PROTECTED]>
wrote:
Hey Ryan,
There's something else, that needs to be set as well - sorry I forgot
about
it.
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
Hope this helps!
W
Hello,
I tried what Winton said. I generated a file with all the
file:///x/y/z
urls, but nutch wont inject any into the crawldb
I even set the crawl-urlfilter.txt to allow everything:
+.
It seems like ./bin/nutch crawl is reading the file, but its finding
0
urls to fetch. I test this on http:// links and they get injected.
Is there a plugin or something ic an modify to allow file urls to be
injected into the crawldb?
Thank you.
-Ryan
On Thu, Jul 3, 2008 at 6:03 PM, Winton Davies <[EMAIL PROTECTED]
>
wrote:
Ryan,
> You can generate a file of FILE urls (eg)
file:///x/y/z/file1.html
file:///x/y/z/file2.html
Use find and AWK accordingly to generate this. put it in the url
directory
and just set depth to 1, and change crawl_urlfilter.txt to admit
file:///x/y/z/ (note, if you dont head qualify it, it will apparently
try to
index directories above the base one, by using ../ notation. (I only
read
this, havent tried it).
then just do the intranet crawl example.
NOTE this will NOT (as far as I can see no matter how much tweaking),
use
ANCHOR TEXT or PageRank (OPIC version) for any links in these files.
The
ONLY way to do this is to use a webserver as far as I can tell. Don't
understand the logic, but there you are. Note, if you use a webserver,
be
aware you will have to disable IGNORE.INTERNAL setting in
Nutch-Site.xml
(you'll be messing around a lot in here).
Cheers,
Winton
At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
>>>>>
Is there a simple way to have nutch index a folder full of other
folders
and
html files?
I was hoping to avoid having to run apache to serve the html files,
and
then
have nutch crawl the site on apache.
Thank you,
-Ryan