Result Summaries not looking to good

2005-03-30 Thread Byron Miller
I've been converting to the latest release and will pull cvs tonight, but i was wondering if there are scoring tweaks recommended for getting better summaries when doing a non intranet crawl. I ofcourse can't find my backup copy of nutch-site when i had better summaries and i'm not sure if its

Re: Port Redirect

2005-04-04 Thread Byron Miller
I use mod_jk as well as squid. At one point i had 3 web servers and ran tomcat stand alone (and resin) and used a squid caching server to proxy requests on port 80 (as well as load balance) Just depends on how you want to go. Single node, mod_jk works the best. -byron -Original

Re: SSh command

2005-04-04 Thread Byron Miller
If you have your own servers i love to use screen. Login - type screen and you get a virtual pty. Start bin/nutch fetch $s1 (or whatever command you want) ctrl-a-d - detaches the screen - you can then logout. log back in - screen -r and you resume your screened sessions. You can also

Re: How can I limit my fetching process?

2005-04-14 Thread Byron Miller
Did you make sure to include the filters in your plugin settings (conf/nutch-site.xml) I must admit, i haven't paid attention to check to see if the plugin is used during the fetch or only when you update the DB (or both?). -Original Message- From: EM [EMAIL PROTECTED] To:

RE: [Nutch-general] RE: Nutch - new public server

2005-04-15 Thread Byron Miller
To add from my experiences: I've preferred Resin (stability performance) I always go for more ram than more servers. It's cheaper in the long run when it comes to man hours and service as well as MTBF for your hardware. Use Squid to proxy/load balance your java servers. This helped alleviate

Re: [Nutch-general] Re: Converted Search.jsp to OpenSearch XSL

2005-04-15 Thread Byron Miller
[%=summxml%]]/description /item % } } % /channel /rss --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Byron Miller wrote: Oh yeah, does anyone have any tips on cleaning up the SUMMARIES so any lingering code, cntrl characters or non XML valid characters don't come through

list archive (Searchable)

2005-04-15 Thread Byron Miller
Is there an archive of the mailing list anymore that is searchable? The old lists on sourceforge are gone and the one on apache's site is just a flatfile of recent subjects. I'm interested in looking up the info on Mapreduce and what it does as well as stuff i missed while i was out :)

Status of map reduce?

2005-04-15 Thread Byron Miller
Dough All, What is the status of map reduce? i just got finished reading your paper and all of the threads and i'm drooling over the notion of such a system :)

did you mean feature

2005-04-19 Thread Byron Miller
I haven't seen anything in the list, but is there any code available? I jumped over to the lucene site but ofcourse the lists aren't searchable right now (get an error) In the faq i did find a reference to a spell checke based on lucene/n-grams. Would this be the best way to offer the feature?

Re: did you mean feature

2005-04-20 Thread Byron Miller
Cutting [EMAIL PROTECTED] To: nutch-user@incubator.apache.org Date: Tue, 19 Apr 2005 12:42:25 -0700 Subject: Re: did you mean feature Byron Miller wrote: I haven't seen anything in the list, but is there any code available? I jumped over to the lucene site but ofcourse the lists aren't

Re: did you mean feature

2005-04-20 Thread Byron Miller
of actual queries users have been doing and construct suggestions based on those. -- Sami Siren Byron Miller wrote: Doug, Thanks for the quick response! I'll take a look at the code and see if i can't come up with something to work. At a quick glance, is this using

Re: [Nutch-general] RE: out of memory exception.

2005-04-23 Thread Byron Miller
If you use the default nutch script i would set a NUTCH_HEAPSIZE of 2000. That generally works for me and i have over 100 million urls in db and generally 10 million urls per segment/index. -byron --- smith learner [EMAIL PROTECTED] wrote: Thanks for your reply. But I guess this solution

Re: [Nutch-general] Terribly slow indexing..

2005-04-24 Thread Byron Miller
and create an index on it? (or am i barking up the wrong tree here?) --- Byron Miller [EMAIL PROTECTED] wrote: I'm not sure what it is, but it seems i can only index about 28-32 pg/sec. While not terribly slow on its own, it did take nearly 30+ hours to index a 4 million page segment. i used

Re: 2 questions

2005-05-02 Thread Byron Miller
Use Jira to look at existing work in progress or to create a todo/feature request that you can attach your diff's to. http://issues.apache.org/jira/browse/Nutch Hint: Create an account login and you will get the create new issue and from there you can do bug/feature/todo features and use the

Buckets instead of one large DB?

2005-05-03 Thread Byron Miller
Is it possiblte to build a bucket or container system that has x amount size and scales to the next bucket once that size has been reached? The issue i have is a db with 235 million pages takes FOREVER to do anything on simply because it makes a duplicate of itself for all processes. Would it

Re: 2 questions

2005-05-03 Thread Byron Miller
it to the project? I didn't know about jira, so I'll take a look there to see if there is something like it beeing implemented. best []s Leonardo Barbosa. On 5/2/05, Byron Miller [EMAIL PROTECTED] wrote: Use Jira to look at existing work in progress or to create a todo/feature request

Re: Index Fails

2005-05-06 Thread Byron Miller
the fat file system may not like some of the names of the index files. Make sure you use fat32 or a native posix/unix file system fat may not like the THOUSANDS of files that are created.. in my segments of a few million documents during the index process i a few thousand files in the index

Re: [Nutch-general] Re: Pre MapReduce Nutch release?

2005-05-18 Thread Byron Miller
Can't wait to try out the mapread stuff. Good luck in getting that branch up and running :) -Original Message- From: [EMAIL PROTECTED] To: nutch-user@incubator.apache.org Date: Wed, 18 May 2005 09:31:03 -0700 (PDT) Subject: Re: [Nutch-general] Re: Pre MapReduce Nutch release? It all

Re: Hardware requirements and some other questions about Nutch

2005-05-21 Thread Byron Miller
like this. thanks, -byron -Original Message- From: Philippe LE NAOUR [EMAIL PROTECTED] To: nutch-user@incubator.apache.org Date: Sat, 21 May 2005 12:47:38 +0200 Subject: Re: Hardware requirements and some other questions about Nutch Thanks for responding. Byron Miller a écrit

Re: Hardware requirements and some other questions about Nutch

2005-05-21 Thread Byron Miller
If you don't run the DB analysis... ;-) Analysis can eat up a terabyte for breakfast. Indeed! we stopped doing db analyze and turned on the scoring per Doug's recommendations - that saved tons of time resources :) That leaves you enough room for your segmetns, db and the space needed to

Re: Hardware requirements and some other questions about Nutch

2005-05-22 Thread Byron Miller
-user@incubator.apache.org Date: Sun, 22 May 2005 00:37:55 +0200 Subject: Re: Hardware requirements and some other questions about Nutch Byron Miller wrote: Here is what the great Doug said: Are you using link analysis? Perhaps it is doing you a disservice by prioritizing one site

Re: Please help: Tomcat problem, Paginating with optimization (Like goggle)

2005-05-23 Thread Byron Miller
Not that this fixes your Tomcat issues, but i have nothing but good things to say about Resin. It handles the load really well, is easy to manage and is pretty light-weight for what it does. I have never had much luck with Tomcat and believe me i've tried many times to go back. Just my 2 cents.

Re: [Nutch-general] RE: Please help: Tomcat problem, Paginating with optimization (Likegoggle)

2005-05-24 Thread Byron Miller
The famous quite is Your mileage may vary. There is an open source version of resin that you can run - caucho.com. Like i said, i've been running nutch under resin for a LONG time. Under tomcat i had issues after issues. -byron -Original Message- From: [EMAIL PROTECTED] [EMAIL