Hi All,
For fun, I created a windows-based installer for Nutch and added a
administrative GUI to it. If interested, you can grab it from
http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html
FreewareFiles .
Regards,
John
--
View this message in context:
shortly via http://www.freewarefiles.com
freewarefiles.com , http://www.download.com download.com , and
http://www.brothersoft.com brothersoft.com .
John Whelan wrote:
Hi All,
For fun, I created a windows-based installer for Nutch and added a
administrative GUI to it. If interested, you
Hello,
The UI is Swing-based, but most of the code for the actions is in the shell
scripts run in Cygwin. In front of the GUI there is a startup BAT file, but
for the most part, everything in this would easily port to Linux or UNIX
environements. When writing this, I kept the idea in mind that
BTW, the new version was just approved by FreewareFiles.com. You can
http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html
download it from them instead of from my site. (I like that they check it
out to make sure it is clean.)
--
View this message in context:
IS there a sizing guide for Nutch? I've looked around a bit, but have not
seen one. The basic questions that I'm looking to answer are sizing memory,
CPU and Disk Space for crawling. I am also interested to know about the
sizes of the largest working implementations. (~reference accounts?)
Hello All,
A while back I wrote a wrapper for Nutch to allow for easy install and
administration on Windows. The other day I submitted a new, 2.0 version of
this application to FreewareFiles. The 2.0 version has the following
improvements:
Added GUI support for HTTP proxies.
Added GUI support
All,
The previous message contained an error... The description is actually at
http://www.whelanlabs.com/content/SearchEngineManager20.htm .
Regards,
John
--
View this message in context:
http://www.nabble.com/Nutch-based-Application-for-Windows-tp23108894p23118490.html
Sent from the Nutch -
Otis,
Thanks for the compliment. I'd be happy to add mention of the wrapper to the
Nutch wiki, but I'm not completely sure that it is appropriate... While it
is a decent wrapper, it isn't an Apache product, and I'd not like to be seen
as inappropriate promoting my piece of software. Any idea as
Happy to do it... I'll add to the wiki when I get back from vacation.
--
View this message in context:
http://www.nabble.com/Nutch-based-Application-for-Windows-tp23108894p23796723.html
Sent from the Nutch - User mailing list archive at Nabble.com.
All,
Version 2.1 has now been release. This version adds the following:
1. Updated Nutch from 1.0-dev (build 2008-10-28) to 1.1-dev (build
2009-09-09)
2. Updated Tomcat from 6.0.16 to 6.0.20.
3. Fixed bugs related to running in non-English locales.
4. Fixed bug in uninstaller. (Improved
Hello,
I'm trying to crawl the local filesystem. It appears that the crawl is
successful, but later searches don't display the content. During the crawl,
I see the following:
...
fetching file:///c:/test/test.txt
fetching http://www.cnn.com/
...
I know from this that it is finding the file
By any chance is your WAR file (used for the web server) from a slightly
different version of Nutch than your crawler? I have seen that this results
in the Nutch page showing up, but no results are listed. Another possibility
is that you are not launching your search engine from the correct
If it were me, I'd try the following...
Use
'http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=listsid=396545660'
as a starting point URL, and set up the following filtering rules
(crawl-urlfilter.txt):
Nutch can index MS Word, MS Powerpoint, MS Excel, and PDF files. In order for
these types to be crawled, you need to have the plugins specified in the
plugin.includes value of nutch/conf/nutch-site.xml (values are
'parse-(msexcel|mspowerpoint|msword|pdf)'.)
I was not sure if the new XLSX format
The default tuning parameters are specified in nutch/conf/nutch-default.xml,
and can be overridden in nutch/conf/nutch-site.xml. (Or in the crawl command
line, but I believe that the 'best practice' is to configure settings in
nutch-site.xml.)
My personal belief is that the two most valuable
Maybe try using '%7e' instead of '~'? For example: www.cs.umbc.edu/%7evarish1
--
View this message in context:
http://old.nabble.com/Nutch-does-not-crawl-pages-starting-with-%7E-tp26312379p26313265.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Any other rules in your filter that preceed that one?
(+^http://([a-z0-9]*\.)*blogspot.com/)
--
View this message in context:
http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26313305.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Well, I found the sources of my problem...
For starters, it appears that directories must be specified as starting
point URLs, not specific files; if files are specified, they seem to be
ignored. Also, when specifying directories, the traversal depth must be set
to account for the directory as a
18 matches
Mail list logo