Nutch-based Application for Windows

2009-03-17 Thread John Whelan
Hi All, For fun, I created a windows-based installer for Nutch and added a administrative GUI to it. If interested, you can grab it from http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html FreewareFiles . Regards, John -- View this message in context:

Re: Nutch-based Application for Windows

2009-03-22 Thread John Whelan
shortly via http://www.freewarefiles.com freewarefiles.com , http://www.download.com download.com , and http://www.brothersoft.com brothersoft.com . John Whelan wrote: Hi All, For fun, I created a windows-based installer for Nutch and added a administrative GUI to it. If interested, you

Re: Nutch-based Application for Windows

2009-03-23 Thread John Whelan
Hello, The UI is Swing-based, but most of the code for the actions is in the shell scripts run in Cygwin. In front of the GUI there is a startup BAT file, but for the most part, everything in this would easily port to Linux or UNIX environements. When writing this, I kept the idea in mind that

Re: Nutch-based Application for Windows

2009-03-23 Thread John Whelan
BTW, the new version was just approved by FreewareFiles.com. You can http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html download it from them instead of from my site. (I like that they check it out to make sure it is clean.) -- View this message in context:

Sizing Guide?

2009-04-11 Thread John Whelan
IS there a sizing guide for Nutch? I've looked around a bit, but have not seen one. The basic questions that I'm looking to answer are sizing memory, CPU and Disk Space for crawling. I am also interested to know about the sizes of the largest working implementations. (~reference accounts?)

Nutch-based Application for Windows

2009-04-17 Thread John Whelan
Hello All, A while back I wrote a wrapper for Nutch to allow for easy install and administration on Windows. The other day I submitted a new, 2.0 version of this application to FreewareFiles. The 2.0 version has the following improvements: Added GUI support for HTTP proxies. Added GUI support

Re: Nutch-based Application for Windows

2009-04-18 Thread John Whelan
All, The previous message contained an error... The description is actually at http://www.whelanlabs.com/content/SearchEngineManager20.htm . Regards, John -- View this message in context: http://www.nabble.com/Nutch-based-Application-for-Windows-tp23108894p23118490.html Sent from the Nutch -

Re: Nutch-based Application for Windows

2009-05-26 Thread John Whelan
Otis, Thanks for the compliment. I'd be happy to add mention of the wrapper to the Nutch wiki, but I'm not completely sure that it is appropriate... While it is a decent wrapper, it isn't an Apache product, and I'd not like to be seen as inappropriate promoting my piece of software. Any idea as

Re: Nutch-based Application for Windows

2009-05-30 Thread John Whelan
Happy to do it... I'll add to the wiki when I get back from vacation. -- View this message in context: http://www.nabble.com/Nutch-based-Application-for-Windows-tp23108894p23796723.html Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch-based Application for Windows - New Release

2009-10-14 Thread John Whelan
All, Version 2.1 has now been release. This version adds the following: 1. Updated Nutch from 1.0-dev (build 2008-10-28) to 1.1-dev (build 2009-09-09) 2. Updated Tomcat from 6.0.16 to 6.0.20. 3. Fixed bugs related to running in non-English locales. 4. Fixed bug in uninstaller. (Improved

no results for local file crawls?

2009-11-07 Thread John Whelan
Hello, I'm trying to crawl the local filesystem. It appears that the crawl is successful, but later searches don't display the content. During the crawl, I see the following: ... fetching file:///c:/test/test.txt fetching http://www.cnn.com/ ... I know from this that it is finding the file

Re: No search results

2009-11-07 Thread John Whelan
By any chance is your WAR file (used for the web server) from a slightly different version of Nutch than your crawler? I have seen that this results in the Nutch page showing up, but no results are listed. Another possibility is that you are not launching your search engine from the correct

Re: How to make nutch crawl within a sub category of an URL?

2009-11-07 Thread John Whelan
If it were me, I'd try the following... Use 'http://answers.yahoo.com/dir/index;_ylt=AqH5s00Y0dXDEjwmdUrxNabpy6IX;_ylv=3?link=listsid=396545660' as a starting point URL, and set up the following filtering rules (crawl-urlfilter.txt):

Re: can Nutch crawl XLS and XLSX file???

2009-11-07 Thread John Whelan
Nutch can index MS Word, MS Powerpoint, MS Excel, and PDF files. In order for these types to be crawled, you need to have the plugins specified in the plugin.includes value of nutch/conf/nutch-site.xml (values are 'parse-(msexcel|mspowerpoint|msword|pdf)'.) I was not sure if the new XLSX format

Re: What are the configuration parameters to fine tune Nutch performance

2009-11-07 Thread John Whelan
The default tuning parameters are specified in nutch/conf/nutch-default.xml, and can be overridden in nutch/conf/nutch-site.xml. (Or in the crawl command line, but I believe that the 'best practice' is to configure settings in nutch-site.xml.) My personal belief is that the two most valuable

Re: Nutch does not crawl pages starting with ~

2009-11-11 Thread John Whelan
Maybe try using '%7e' instead of '~'? For example: www.cs.umbc.edu/%7evarish1 -- View this message in context: http://old.nabble.com/Nutch-does-not-crawl-pages-starting-with-%7E-tp26312379p26313265.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Stopping at depth=0 - no more URLs to fetch

2009-11-11 Thread John Whelan
Any other rules in your filter that preceed that one? (+^http://([a-z0-9]*\.)*blogspot.com/) -- View this message in context: http://old.nabble.com/Stopping-at-depth%3D0---no-more-URLs-to-fetch-tp26310955p26313305.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: no results for local file crawls?

2009-11-12 Thread John Whelan
Well, I found the sources of my problem... For starters, it appears that directories must be specified as starting point URLs, not specific files; if files are specified, they seem to be ignored. Also, when specifying directories, the traversal depth must be set to account for the directory as a