Thanks very much for your help Stefan. Unfortunately, I think I may have
unwittingly misled you. It seems that the distributed search aspect of
my setup was probably irrelevant for the problem I was experiencing.
It turns out that I had been using jakarta-tomcat-5.0.28 since I had
integrated tomcat into Apache as per the instructions on this page:
http://www.meritonlinesystems.com/docs/apache_tomcat_redhat.html which
required using Tomcat 5.0.28. My distributed search setup was therefore
using a mixture of Tomcat 5.0.28 and Tomcat 4.1.31 and my local search
setup was using Tomcat 4.1.31 only. So, the actual servers in the
distributed configuration were using Tomcat 4.1.31 and it turns out that
the error was occurring on the interface machine (the one using Tomcat
5.0.28) where the catalina logs showed that the query was corrupted as
soon as it is received from the browser. In other words it's hardly
surprising the distributed searchers returned garbage given the garbled
input query.
Anyway, in the nutch tutorial it says we should use Tomcat 4.x and lo
and behold doing so everywhere incurs no problems. Is there any good
reason why Tomcat 5.x has problems?
Sticking with Tomcat 4.x does anyone have (or has anyone seen)
instructions on integrating Tomcat 4.x with Apache 2 successfully?
-Ed
Stefan Groschupf wrote:
Ed,
it is definitely not a encoding problem with rpc calls.
Following test pass on my box. It would be interesting to find the
problem but setting up a distributed system to verify your problem is
too time expansive.
Can you try using the latest sources and check if this still occurs?
I will read some more code and see if I can find anything that is
like a problem.
It would be great if one from the community can verify if this is
really a bug and if it reproducible.
That search results using distribute search are different is a known
problem (see jira).
Can you provide a secodn tomcat running on a other port or may just a
other tomcat context running a nutch ui pointing to a local index?
Stefan
/**
* Copyright 2005 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.nutch.ipc;
import java.lang.reflect.Method;
import java.net.InetSocketAddress;
import junit.framework.TestCase;
import org.apache.nutch.io.UTF8;
public class TestEncoding extends TestCase {
private int PORT = 50232;
private String TEXT = "座頭市"; // no idea what this means :)
public void testEncoding() throws Exception {
Server server = RPC.getServer(new HelloWorld(), PORT);
server.start();
Method method = HelloWorld.class.getMethod("helloWorld",
new Class[] { UTF8.class });
Object[][] parameter = new Object[1][1];
parameter[0][0] = new UTF8(TEXT);
UTF8[] values = (UTF8[]) RPC.call(method, parameter,
new InetSocketAddress[] { new InetSocketAddress
("127.0.0.1",
PORT) });
assertEquals(TEXT, values[0].toString());
}
class HelloWorld {
public UTF8 helloWorld(UTF8 utf8) {
return utf8;
}
}
}
Am 27.12.2005 um 05:38 schrieb Ed Whittaker:
Hi,
I'm running nutch-0.7.1 on a couple of RedHat-9 linux machines. When I
execute "catalina.sh start" in the crawl directory (i.e. not using
distributed search) and query with a 2 Kanji Japanese string everything
works fine, i.e. the pages seem relevant and the output is in the
correct
encoding.
However, when I run a distributed search using one search server
specified
in search-servers.txt and the same index as used above, the
*returned pages
are not the same* and the *output is corrupted*. To see an example
of this
go to:
http://asked.ru/search.jsp?query=%E6%9D%B1%E4%BA%AC
This queries nutch with the string for Tokyo in Japanese.
Unfortunately, I
can't provide access to an example of the working (non-distributed)
setup
but trust me it looks good.
Note, this is not a problem concerning the Tomcat integration with
Apache
since accessing the distributed search setup via http://localhost:
8080 gives
identical (corrupted) output to what you'll get if you click on the
above
link.
I would guess this is some socket encoding problem since that is
ostensibly
the only difference in the 2 configurations, isn't it?
Does anyone have a distributed search setup which doesn't have these
encoding problems? i.e. is it something wrong with my setup
somewhere. Or,
is this a known bug?
-Ed