Re: JNI and calling Hadoop jar files
The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*, implies strongly that a hadoop-default.xml file, or at least a job.xml file is present. Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the assumption is that the core jar is available. The class not found exception, the implication is that the hadoop-0.X.Y-core.jar is not available to jni. Given the above constraints, the two likely possibilities are that the -core jar is unavailable or damaged, or that somehow the classloader being used does not have access to the -core jar. A possible reason for the jar not being available is that the application is running on a different machine, or as a different user and the jar is not actually present or perhaps readable in the expected location. Which way is your JNI, java application calling into a native shared library, or a native application calling into a jvm that it instantiates via libjvm calls? Could you dump the classpath that is in effect before your failing jni call? System.getProperty( java.class.path), and for that matter, java.library.path, or getenv(CLASSPATH) and provide an ls -l of the core.jar from the class path, run as the user that owns the process, on the machine that the process is running on. !-- from hadoop-default.xml -- property namefs.hdfs.impl/name valueorg.apache.hadoop.hdfs.DistributedFileSystem/value descriptionThe FileSystem for hdfs: uris./description /property On Mon, Mar 23, 2009 at 9:47 PM, Jeff Eastman j...@windwardsolutions.comwrote: This looks somewhat similar to my Subtle Classloader Issue from yesterday. I'll be watching this thread too. Jeff Saptarshi Guha wrote: Hello, I'm using some JNI interfaces, via a R. My classpath contains all the jar files in $HADOOP_HOME and $HADOOP_HOME/lib My class is public SeqKeyList() throws Exception { config = new org.apache.hadoop.conf.Configuration(); config.addResource(new Path(System.getenv(HADOOP_CONF_DIR) +/hadoop-default.xml)); config.addResource(new Path(System.getenv(HADOOP_CONF_DIR) +/hadoop-site.xml)); System.out.println(C=+config); filesystem = FileSystem.get(config); System.out.println(C=+config+F= +filesystem); System.out.println(filesystem.getUri().getScheme()); } I am using a distributed filesystem (org.apache.hadoop.hdfs.DistributedFileSystem for fs.hdfs.impl). When run from the command line and this class is created everything works fine When called using jni I get java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem Is this a jni issue? How can it work from the commandline using the same classpath, yet throw this is exception when run via JNI? Saptarshi Guha -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: RDF store over HDFS/HBase
So one of the things that I've thought about with using HBase for RDF storage was whether to keep blank nodes or not. When I've spoke about supporting blank nodes I've always talked about requiring a global lock on the system in order to ensure that if you refer to a blank node on one node in the cluster that it is the same node in another. I'd be interested in this part of your solution. 2009/3/24 Philip M. White p...@qnan.org: On Mon, Mar 23, 2009 at 05:33:46PM -0700, stack wrote: Anywhere we can go to learn more about the effort? What can we do in HBase to make the project more likely to succeed? Right now we don't have anything of value to show you, but we plan to move on this pretty quickly. We're copying the functionality of using HBase as the persistent store from another (proprietary) project. If you (or anyone else) would like to participate in this development, let me know. We can work together on this. -- Philip
Re: RDF store over HDFS/HBase
I would expect HBase would scale well - the semantics of the data being stored shouldn't matter, just the size. I think there are a number of production HBase installations that have billions of rows. On Mon, Mar 23, 2009 at 4:10 PM, Ding, Hui hui.d...@sap.com wrote: I remember there was a project proposal back in late last year. They've set up an official webpage.Not sure if they are still alive/making any progress. You can search in the email archive. -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com] Sent: Monday, March 23, 2009 4:07 PM To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org; core-...@hadoop.apache.org Subject: RDF store over HDFS/HBase Has anyone explored using HDFS/HBase as the underlying storage for an RDF store? Most solutions (all are single node) that I have found till now scale up only to a couple of billion rows in the Triple store. Wondering how Hadoop could be leveraged here... Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
RE: RDF store over HDFS/HBase
I remember there was a project proposal back in late last year. They've set up an official webpage.Not sure if they are still alive/making any progress. You can search in the email archive. -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com] Sent: Monday, March 23, 2009 4:07 PM To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org; core-...@hadoop.apache.org Subject: RDF store over HDFS/HBase Has anyone explored using HDFS/HBase as the underlying storage for an RDF store? Most solutions (all are single node) that I have found till now scale up only to a couple of billion rows in the Triple store. Wondering how Hadoop could be leveraged here... Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
Re: RDF store over HDFS/HBase
Have you heard of the Heart project? http://rdf-proj.blogspot.com/ I don't know of its current status. - Andy From: Amandeep Khurana Subject: RDF store over HDFS/HBase Has anyone explored using HDFS/HBase as the underlying storage for an RDF store?
Re: Reduce doesn't start until map finishes
Just to inform, we installed v.0.21.0-dev and there is no such issue now. 2009/3/6 Rasit OZDAS rasitoz...@gmail.com So, is there currently no solution to my problem? Should I live with it? Or do we have to have a JIRA for this? What do you think? 2009/3/4 Nick Cen cenyo...@gmail.com Thanks, about the Secondary Sort, can you provide some example. What does the intermediate keys stands for? Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2) and the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same partition and k1 k2, so i think the order inside reducer maybe: (k1,v1) (k1,v3) (k2,v2) (k2,v4) can the Secondary Sort change this order? 2009/3/4 Chris Douglas chri...@yahoo-inc.com The output of each map is sorted by partition and by key within that partition. The reduce merges sorted map output assigned to its partition into the reduce. The following may be helpful: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html If your job requires total order, consider o.a.h.mapred.lib.TotalOrderPartitioner. -C On Mar 3, 2009, at 7:24 PM, Nick Cen wrote: can you provide more info about sortint? The sort is happend on the whole data set, or just on the specified partion? 2009/3/4 Mikhail Yakshin greycat.na@gmail.com On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote: This is normal behavior. The Reducer is guaranteed to receive all the results for its partition in sorted order. No reduce can start until all the maps are completed, since any running map could emit a result that would violate the order for the results it currently has. -C _Reducers_ usually start almost immediately and start downloading data emitted by mappers as they go. This is their first phase. Their second phase can start only after completion of all mappers. In their second phase, they're sorting received data, and in their third phase they're doing real reduction. -- WBR, Mikhail Yakshin -- http://daily.appspot.com/food/ -- http://daily.appspot.com/food/ -- M. Raşit ÖZDAŞ -- M. Raşit ÖZDAŞ
Re: Reduce doesn't start until map finishes
What happened is that we added fast start (HADOOP-3136), which launches more than one task per a heartbeat. Previously, if you maps didn't take very long, they finished before the heartbeat and the task tracker was assigned a new map task. A side effect was that no reduce tasks were launched until the maps were complete, which prevents the shuffle from overlapping with the maps. -- Owen
Re: Join Variation
Have you considered hbase for this particular task? Looks like a simple lookup using the network mask as key would solve your problem. Its also possible to derive the network class (A,B,C) based on the network class of the concerned ip. But I guess your search file will cover ranges in more detail than just on class level. On Tue, Mar 24, 2009 at 12:33 PM, Tamir Kamara tamirkam...@gmail.com wrote: Hi, We need to implement a Join with a between operator instead of an equal. What we are trying to do is search a file for a key where the key falls between two fields in the search file like this: main file (ip, a, b): (80, zz, yy) (125, vv, bb) search file (from-ip, to-ip, d, e): (52, 75, xxx, yyy) (78, 98, aaa, bbb) (99, 115, xxx, ddd) (125, 130, hhh, aaa) (150, 162, qqq, sss) the outcome should be in the form (ip, a, b, d, e): (80, zz, yy, aaa, bbb) (125, vv, bb, eee, hhh) We could convert the ip ranges in the search file to single record ips and then do a regular join, but the number of single ips is huge and this is probably not a good way. What would be a good course for doing this in hadoop ? Thanks, Tamir
Re: Join Variation
Hello Tamir , I think the better and simple way of doing this through Pig. http://wiki.apache.org/pig/PigOverview As Pig provides SQL type of interface over Hadoop and support the kind of operation you need to do with data quite easily. Thanks , --- Peeyush On Tue, 2009-03-24 at 13:33 +0200, Tamir Kamara wrote: Hi, We need to implement a Join with a between operator instead of an equal. What we are trying to do is search a file for a key where the key falls between two fields in the search file like this: main file (ip, a, b): (80, zz, yy) (125, vv, bb) search file (from-ip, to-ip, d, e): (52, 75, xxx, yyy) (78, 98, aaa, bbb) (99, 115, xxx, ddd) (125, 130, hhh, aaa) (150, 162, qqq, sss) the outcome should be in the form (ip, a, b, d, e): (80, zz, yy, aaa, bbb) (125, vv, bb, eee, hhh) We could convert the ip ranges in the search file to single record ips and then do a regular join, but the number of single ips is huge and this is probably not a good way. What would be a good course for doing this in hadoop ? Thanks, Tamir
Small Test Data Sets
I want to confirm something with the list that I'm seeing; I needed to confirm that my Reader was reading our file format correctly, so I created a MR job that simply output each K/V pair to the reducer, which then just wrote out each one to the output file. This allows me to check by hand that all K/V points of data from our file format are getting pulled out of the file correctly. I have setup our InputFormat, RecordReader, and Reader subclasses for our specific file format. While running some basic tests on a small (1meg) single file I noticed something odd --- I was getting 2 copies of each data point in the output file. Initially I thought my Reader was just somehow reading the data point and not moving the read head, but I verified that was not the case through a series of tests. I then went on to reason that since I had 2 mappers by default on my job, and only 1 input file, that each mapper must be reading the file independently. I then set the -m flag to 1, and I got the proper output; Is it safe to assume in testing on a file that is smaller than the block size that I should always use -m 1 in order to get proper block-mapper mapping? Also, should I assume that if you have more mappers than disk blocks involved that you will get duplicate values? I may have set something wrong, I just wanted to check. Thanks Josh Patterson TVA
Re: hadoop need help please suggest
What is scale you are thinking of? (10s, 100s or more nodes)? The memory for metadata at NameNode you mentioned is that main issue with small files. There are multiple alternatives for the dealing with that. This issue is discussed many times here. Also please use core-user@ id alone for asking for help.. you don't need to send to core-devel@ Raghu. snehal nagmote wrote: Hello Sir, I have some doubts, please help me. we have requirement of scalable storage system, we have developed one agro-advisory system in which farmers will sent the crop pictures particularly in sequential manner some 6-7 photos of 3-4 kb each would be stored in storage server and these photos would be read sequentially by scientist to detect the problem, writing to images would not be done. So for storing these images we are using hadoop file system, is it feasible to use hadoop file system for the same purpose. As also the images are of only 3-4 kb and hadoop reads the data in blocks of size 64 mb how can we increase the performance, what could be the tricks and tweaks that should be done to use hadoop for such kind of purpose. Next problem is as hadoop stores all the metadata in memory,can we use some mechanism to store the files in the block of some greater size because as the files would be of small size,so it will store the lots metadata and will overflow the main memory please suggest what could be done regards, Snehal
Re: Broder or other near-duplicate algorithms?
hi Mark we had done something on top of hadoop/hbase (mapreduce for evaluation , hbase for online serving ) by reference http://www2007.org/papers/paper215.pdf Hi, does anybody know of an open-source implementation of the Broder algorithmhttp://www.std.org/%7Emsm/common/clustering.htmlin Hadoop? Monika Henzinger reports having done http://ltaa.epfl.ch/monika/mpapers/nearduplicates2006.pdf so in MapReduce, and I wonder if somebody has repeated her work in open source? I am going to do this if there is no implementation yet, and then I will ask what I can do with the code. Cheers, Mark -- Yi-Kai Tsai (cuma) yi...@yahoo-inc.com, Asia Search Engineering.
Re: Broder or other near-duplicate algorithms?
Yi-Kai, that's good to know - and I have read this article - but is your code available? Thank you, Mark On Tue, Mar 24, 2009 at 9:51 AM, Yi-Kai Tsai yi...@yahoo-inc.com wrote: hi Mark we had done something on top of hadoop/hbase (mapreduce for evaluation , hbase for online serving ) by reference http://www2007.org/papers/paper215.pdf Hi, does anybody know of an open-source implementation of the Broder algorithmhttp://www.std.org/%7Emsm/common/clustering.htmlin Hadoop? Monika Henzinger reports having done http://ltaa.epfl.ch/monika/mpapers/nearduplicates2006.pdf so in MapReduce, and I wonder if somebody has repeated her work in open source? I am going to do this if there is no implementation yet, and then I will ask what I can do with the code. Cheers, Mark -- Yi-Kai Tsai (cuma) yi...@yahoo-inc.com, Asia Search Engineering.
Software Development Process Help
Hello Everybody, Over the past couple of months I documented a new software development process for open source projects and communities. The new process attempts to address shortcomings of existing major software development processes and build upon existing attempts of open source communities. A document with a detailed description of the process along with a short presentation of the process can be found in the survey described below. This work will be my thesis for a Master’s degree in Software Engineering. I would need help from open source contributors to validate my work. I created a simple survey with seven basic questions that would help understand if the process is applicable and viable in the open source space. The survey can be found here: http://spreadsheets.google.com/viewform?hl=enformkey=cFg5UUVKakwyOTJ2eDhNWDM5WUlfVlE6MA . The survey is completely anonymous. I apologize in advance for any grammatical errors or mistakes; the document is a rough draft (I am editing the document everyday). Any help would be greatly appreciated and acknowledged! Please feel free to contact me for any additional details or questions. Thank you, Stefan
Help Indexing network traffic
Hi all, I have a txt file that captured all of my network traffic (IP address, ports, ect. ), I was wondering if you can help me filter out a particular IP address. Thank you, Nga
virtualization with hadoop
Hi, I have created hadoop cluster on single machine using different vm instances . Now will the replication factor be effective also I wanted to know about the performance of the hdfs. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Need Help hdfs -How to minimize access Time
Hello Sir, I am doing mtech in iiit hyderabad , I am doing research project whose aim is to develop the scalable storage system For esagu. The esagu is all about taking the crop images from the fields and store it in the filesystem and then those images would be accessed by agricultural scientist to detect the problem, So currently many fields in the A.P. are using this system,it may go beyond A.Pso we require storage system 1)My problem is we are using hadoop for the storage, but hadoop retrieves (reads/writes) in 64 mb chunk . these images stored would be very small size say max 2 to 3 mb, So access time would be larger in case of accessing images, Can you suggest how this access time can be reduced.Is there anyother thing we could do to improve the performance like building our own cache, To what extent it would be feasible or helpful in such kind of application. 2)Second is does hadoop would be useful for small small data like this, if not what tricks we could do to make it usable for such knid of application Please help, Thanks in advance Regards, Snehal Nagmote IIIT Hyderabad