Hi Aaron, When running this command I get the following error.
blur> create repl-table hdfs://namenode.domain.com:8020/blur/repl-table 1 org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 Would this be related to the fact that I am running CDH4? Paul. *From:* Aaron McCurry <[email protected]> *Sent:* 20 February 2013 19:23 *To:* [email protected] *Subject:* Re: New to Search and Blur Welcome Paul! I will try to answer your questions below: On Wed, Feb 20, 2013 at 1:41 PM, Paul O'Donoghue <[email protected]> wrote: > Hi, > > First up I would like to say I’m really excited by the Blur project, it > seems to fit the need of a potential project perfectly. I’m hoping that I > can someday contribute back to this project in some way as it seems that it > will be of enormous help to me. > > Now, on to the meat of the issue. I’m a complete search newbie. I am coming > from a Spring/Application development background but have to get involved > in the Search/Big data field for a current client. Since the new year I > have been looking at Hadoop and have setup a small cluster using Cloudera’s > excellent tools. I’ve been downloading datasets, running MR jobs, etc. and > think I have gleaned a very basic level of knowledge which is enough for me > to learn more when I need it. This week I have started looking at Blur, and > at present I have cloned the src to the hadoop namenode where I have built > and started the blur servers. But now I am stuck, and don’t know where to > go. So I will ask the following > > 1 - /apache-blur-0.2.0-SNAPSHOT/conf/servers. At present I just have my > namenode defined in here. Do I need to add my datanodes as well? > So you don't have to but the normal configuration would be to run blur along side the datanodes. Which means you will have to copy the SNAPSHOT directory to all the datanodes as well as adding all the datanodes to the servers file. However if you want to start simple then you could just run blur on a single node, the namenode could work. Just to be clear, I would not recommend running Blur on the same machine as your namenode in a production environment, but for testing it should be fine. I would however put the name of your server in servers file and remove localhost. > > 2 - blur> create repl-table hdfs://localhost:9000/blur/repl-table 1 > java.net.ConnectException: Call to localhost/127.0.0.1:9000 failed on > connection exception: java.net.ConnectException: Connection refused. > > I’m confused here. Is 9000 the correct port? Is there some sort of user > auth issue? > I would change the command to be "create repl-table" hdfs://<namenode>/blur/repl-table 1 The <namenode> should be as the fs.default.name in your core-site.xml in the hadoop/conf directory. > > 3 - Assuming I create a table on the hdfs, when I want to import my data > into it I use a MR job yes? What is the best way to package this job? Do I > have to include all the Blur jars or do I install Blur on the datanodes and > set a classpath? Is it possible to link to an example MR job in a maven > project? Or am I on completely the wrong track. > You are on the right track, however you won't need to package up the jar files across the cluster. We haven't built a nice automated way to run map reduce jobs but this is what you need to do. Take a look at: https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=src/blur-mapred/src/main/java/org/apache/blur/example/BlurExampleIndexWriter.java;h=9d6eb546e565303f328556fea29d1345344e8065;hb=0.2-dev This is a writing example in the new blur code (0.2.x), there is also a reading example in the same package. This example actually pushes the updates through the thrift API, the bulk importer that writes indexes directly to HDFS has not been rewritten for 0.2 yet. As for the blur libraries, you can use the simple approach of putting all the jars in a lib folder and creating a single jar including your classes and the lib folder (jars inside the jar). Hadoop understands that the lib folder in the jar file is to be added to the classpath of the running tasks. Thus it will automatically distribute the libraries on to the Hadoop MR cluster. Let us know if you have more questions and how we can help. Thanks! Aaron > > Thanks for your help, > > Paul. >
