Re: JNI and calling Hadoop jar files

2009-03-24 Thread jason hadoop
The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*,
implies strongly that a hadoop-default.xml file, or at least a  job.xml file
is present.
Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the
assumption is that the core jar is available.
The class not found exception, the implication is that the
hadoop-0.X.Y-core.jar is not available to jni.

Given the above constraints, the two likely possibilities are that the -core
jar is unavailable or damaged, or that somehow the classloader being used
does not have access to the -core  jar.

A possible reason for the jar not being available is that the application is
running on a different machine, or as a different user and the jar is not
actually present or perhaps readable in the expected location.





Which way is your JNI, java application calling into a native shared
library, or a native application calling into a jvm that it instantiates via
libjvm calls?

Could you dump the classpath that is in effect before your failing jni call?
System.getProperty( java.class.path), and for that matter,
java.library.path, or getenv(CLASSPATH)
and provide an ls -l of the core.jar from the class path, run as the user
that owns the process, on the machine that the process is running on.

!-- from hadoop-default.xml --
property
  namefs.hdfs.impl/name
  valueorg.apache.hadoop.hdfs.DistributedFileSystem/value
  descriptionThe FileSystem for hdfs: uris./description
/property



On Mon, Mar 23, 2009 at 9:47 PM, Jeff Eastman j...@windwardsolutions.comwrote:

 This looks somewhat similar to my Subtle Classloader Issue from yesterday.
 I'll be watching this thread too.

 Jeff


 Saptarshi Guha wrote:

 Hello,
 I'm using some JNI interfaces, via a R. My classpath contains all the
 jar files in $HADOOP_HOME and $HADOOP_HOME/lib
 My class is
public SeqKeyList() throws Exception {

config = new  org.apache.hadoop.conf.Configuration();
config.addResource(new Path(System.getenv(HADOOP_CONF_DIR)
+/hadoop-default.xml));
config.addResource(new Path(System.getenv(HADOOP_CONF_DIR)
+/hadoop-site.xml));

System.out.println(C=+config);
filesystem = FileSystem.get(config);
System.out.println(C=+config+F= +filesystem);
System.out.println(filesystem.getUri().getScheme());

}

 I am using a distributed filesystem
 (org.apache.hadoop.hdfs.DistributedFileSystem for fs.hdfs.impl).
 When run from the command line and this class is created everything works
 fine
 When called using jni I get
  java.lang.ClassNotFoundException:
 org.apache.hadoop.hdfs.DistributedFileSystem

 Is this a jni issue? How can it work from the commandline using the
 same classpath, yet throw this is exception when run via JNI?
 Saptarshi Guha








-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: RDF store over HDFS/HBase

2009-03-24 Thread Andrew Newman
So one of the things that I've thought about with using HBase for RDF
storage was whether to keep blank nodes or not.  When I've spoke about
supporting blank nodes I've always talked about requiring a global
lock on the system in order to ensure that if you refer to a blank
node on one node in the cluster that it is the same node in another.
I'd be interested in this part of your solution.

2009/3/24 Philip M. White p...@qnan.org:
 On Mon, Mar 23, 2009 at 05:33:46PM -0700, stack wrote:
 Anywhere we can go to learn more about the effort?  What can we do in HBase
 to make the project more likely to succeed?

 Right now we don't have anything of value to show you, but we plan to
 move on this pretty quickly.  We're copying the functionality of using
 HBase as the persistent store from another (proprietary) project.

 If you (or anyone else) would like to participate in this development,
 let me know.  We can work together on this.

 --
 Philip



Re: RDF store over HDFS/HBase

2009-03-24 Thread Ryan Rawson
I would expect HBase would scale well - the semantics of the data being
stored shouldn't matter, just the size.

I think there are a number of production HBase installations that have
billions of rows.

On Mon, Mar 23, 2009 at 4:10 PM, Ding, Hui hui.d...@sap.com wrote:

 I remember there was a project proposal back in late last year.  They've
 set up an official  webpage.Not sure if they are still alive/making any
 progress.
 You  can search in the email archive.

 -Original Message-
 From: Amandeep Khurana [mailto:ama...@gmail.com]
 Sent: Monday, March 23, 2009 4:07 PM
 To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org;
 core-...@hadoop.apache.org
 Subject: RDF store over HDFS/HBase

 Has anyone explored using HDFS/HBase as the underlying storage for an
 RDF
 store? Most solutions (all are single node) that I have found till now
 scale
 up only to a couple of billion rows in the Triple store. Wondering how
 Hadoop could be leveraged here...

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz



RE: RDF store over HDFS/HBase

2009-03-24 Thread Ding, Hui
I remember there was a project proposal back in late last year.  They've
set up an official  webpage.Not sure if they are still alive/making any
progress.
You  can search in the email archive.

-Original Message-
From: Amandeep Khurana [mailto:ama...@gmail.com] 
Sent: Monday, March 23, 2009 4:07 PM
To: hbase-u...@hadoop.apache.org; core-user@hadoop.apache.org;
core-...@hadoop.apache.org
Subject: RDF store over HDFS/HBase

Has anyone explored using HDFS/HBase as the underlying storage for an
RDF
store? Most solutions (all are single node) that I have found till now
scale
up only to a couple of billion rows in the Triple store. Wondering how
Hadoop could be leveraged here...

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: RDF store over HDFS/HBase

2009-03-24 Thread Andrew Purtell

Have you heard of the Heart project?

http://rdf-proj.blogspot.com/

I don't know of its current status. 

   - Andy


 From: Amandeep Khurana 
 Subject: RDF store over HDFS/HBase

 Has anyone explored using HDFS/HBase as the underlying
 storage for an RDF store?



  


Re: Reduce doesn't start until map finishes

2009-03-24 Thread Rasit OZDAS
Just to inform, we installed v.0.21.0-dev and there is no such issue now.

2009/3/6 Rasit OZDAS rasitoz...@gmail.com

 So, is there currently no solution to my problem?
 Should I live with it? Or do we have to have a JIRA for this?
 What do you think?


 2009/3/4 Nick Cen cenyo...@gmail.com

 Thanks, about the Secondary Sort, can you provide some example. What does
 the intermediate keys stands for?

 Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2)
 and
 the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same
 partition and k1  k2, so i think the order inside reducer maybe:
 (k1,v1)
 (k1,v3)
 (k2,v2)
 (k2,v4)

 can the Secondary Sort change this order?



 2009/3/4 Chris Douglas chri...@yahoo-inc.com

  The output of each map is sorted by partition and by key within that
  partition. The reduce merges sorted map output assigned to its partition
  into the reduce. The following may be helpful:
 
  http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
 
  If your job requires total order, consider
  o.a.h.mapred.lib.TotalOrderPartitioner. -C
 
 
  On Mar 3, 2009, at 7:24 PM, Nick Cen wrote:
 
   can you provide more info about sortint? The sort is happend on the
 whole
  data set, or just on the specified partion?
 
  2009/3/4 Mikhail Yakshin greycat.na@gmail.com
 
   On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote:
 
  This is normal behavior. The Reducer is guaranteed to receive all the
  results for its partition in sorted order. No reduce can start until
 all
 
  the
 
  maps are completed, since any running map could emit a result that
 would
  violate the order for the results it currently has. -C
 
 
  _Reducers_ usually start almost immediately and start downloading data
  emitted by mappers as they go. This is their first phase. Their second
  phase can start only after completion of all mappers. In their second
  phase, they're sorting received data, and in their third phase they're
  doing real reduction.
 
  --
  WBR, Mikhail Yakshin
 
 
 
 
  --
  http://daily.appspot.com/food/
 
 
 


 --
 http://daily.appspot.com/food/




 --
 M. Raşit ÖZDAŞ




-- 
M. Raşit ÖZDAŞ


Re: Reduce doesn't start until map finishes

2009-03-24 Thread Owen O'Malley
What happened is that we added fast start (HADOOP-3136), which
launches more than one task per a heartbeat. Previously, if you maps
didn't take very long, they finished before the heartbeat and the task
tracker was assigned a new map task. A side effect was that no reduce
tasks were launched until the maps were complete, which prevents the
shuffle from overlapping with the maps.

-- Owen


Re: Join Variation

2009-03-24 Thread Stefan Podkowinski
Have you considered hbase for this particular task?
Looks like a simple lookup using the network mask as key would solve
your problem.

Its also possible to derive the network class (A,B,C) based on the
network class of the concerned ip. But I guess your search file will
cover ranges in more detail than just on class level.

On Tue, Mar 24, 2009 at 12:33 PM, Tamir Kamara tamirkam...@gmail.com wrote:
 Hi,

 We need to implement a Join with a between operator instead of an equal.
 What we are trying to do is search a file for a key where the key falls
 between two fields in the search file like this:

 main file (ip, a, b):
 (80, zz, yy)
 (125, vv, bb)

 search file (from-ip, to-ip, d, e):
 (52, 75, xxx, yyy)
 (78, 98, aaa, bbb)
 (99, 115, xxx, ddd)
 (125, 130, hhh, aaa)
 (150, 162, qqq, sss)

 the outcome should be in the form (ip, a, b, d, e):
 (80, zz, yy, aaa, bbb)
 (125, vv, bb, eee, hhh)

 We could convert the ip ranges in the search file to single record ips and
 then do a regular join, but the number of single ips is huge and this is
 probably not a good way.
 What would be a good course for doing this in hadoop ?


 Thanks,
 Tamir



Re: Join Variation

2009-03-24 Thread Peeyush Bishnoi
Hello Tamir ,

I think the better and simple way of doing this through Pig. 

http://wiki.apache.org/pig/PigOverview 

As Pig provides SQL type of interface over Hadoop  and support the kind
of operation you need to do with data quite easily.


Thanks ,

---
Peeyush

On Tue, 2009-03-24 at 13:33 +0200, Tamir Kamara wrote:

 Hi,
 
 We need to implement a Join with a between operator instead of an equal.
 What we are trying to do is search a file for a key where the key falls
 between two fields in the search file like this:
 
 main file (ip, a, b):
 (80, zz, yy)
 (125, vv, bb)
 
 search file (from-ip, to-ip, d, e):
 (52, 75, xxx, yyy)
 (78, 98, aaa, bbb)
 (99, 115, xxx, ddd)
 (125, 130, hhh, aaa)
 (150, 162, qqq, sss)
 
 the outcome should be in the form (ip, a, b, d, e):
 (80, zz, yy, aaa, bbb)
 (125, vv, bb, eee, hhh)
 
 We could convert the ip ranges in the search file to single record ips and
 then do a regular join, but the number of single ips is huge and this is
 probably not a good way.
 What would be a good course for doing this in hadoop ?
 
 
 Thanks,
 Tamir


Small Test Data Sets

2009-03-24 Thread Patterson, Josh
I want to confirm something with the list that I'm seeing;
 
I needed to confirm that my Reader was reading our file format
correctly, so I created a MR job that simply output each K/V pair to the
reducer, which then just wrote out each one to the output file. This
allows me to check by hand that all K/V points of data from our file
format are getting pulled out of the file correctly. I have setup our
InputFormat, RecordReader, and Reader subclasses for our specific file
format.
 
While running some basic tests on a small (1meg) single file I noticed
something odd --- I was getting 2 copies of each data point in the
output file. Initially I thought my Reader was just somehow reading the
data point and not moving the read head, but I verified that was not the
case through a series of tests.
 
I then went on to reason that since I had 2 mappers by default on my
job, and only 1 input file, that each mapper must be reading the file
independently. I then set the -m flag to 1, and I got the proper output;
Is it safe to assume in testing on a file that is smaller than the block
size that I should always use -m 1 in order to get proper block-mapper
mapping? Also, should I assume that if you have more mappers than disk
blocks involved that you will get duplicate values? I may have set
something wrong, I just wanted to check. Thanks
 
Josh Patterson
TVA
 


Re: hadoop need help please suggest

2009-03-24 Thread Raghu Angadi


What is scale you are thinking of? (10s, 100s or more nodes)?

The memory for metadata at NameNode you mentioned is that main issue 
with small files. There are multiple alternatives for the dealing with 
that. This issue is discussed many times here.


Also please use core-user@ id alone for asking for help.. you don't need 
to send to core-devel@


Raghu.

snehal nagmote wrote:

Hello Sir,

I have some doubts, please help me.
we have requirement of scalable storage system, we have developed one
agro-advisory system in which farmers will sent the crop pictures
particularly in sequential manner some
6-7 photos of 3-4 kb each would be stored in storage server and these photos
would be read sequentially by scientist to detect the problem, writing to
images would not be done.

So for storing these images we  are using hadoop file system, is it feasible
to use hadoop
file system for the same purpose.

As also the images are of only 3-4 kb and hadoop reads the data in blocks of
size 64 mb
how can we increase the performance, what could be the tricks and tweaks
that should be done to use hadoop for such kind of purpose.

Next problem is as hadoop stores all the metadata in memory,can we use some
mechanism to store the files in the block of some greater size because as
the files would be of small size,so it will store the lots metadata and will
overflow the main memory
please suggest what could be done


regards,
Snehal





Re: Broder or other near-duplicate algorithms?

2009-03-24 Thread Yi-Kai Tsai

hi Mark

we had done something on top of hadoop/hbase (mapreduce for evaluation , 
hbase for  online serving )

by reference http://www2007.org/papers/paper215.pdf


Hi,

does anybody know of an open-source implementation of the Broder
algorithmhttp://www.std.org/%7Emsm/common/clustering.htmlin Hadoop?
Monika Henzinger reports
having done http://ltaa.epfl.ch/monika/mpapers/nearduplicates2006.pdf so
in MapReduce, and I wonder if somebody has repeated her work in open source?

I am going to do this if there is no implementation yet, and then I will ask
what I can do with the code.

Cheers,
Mark
  



--
Yi-Kai Tsai (cuma) yi...@yahoo-inc.com, Asia Search Engineering.



Re: Broder or other near-duplicate algorithms?

2009-03-24 Thread Mark Kerzner
Yi-Kai,
that's good to know - and I have read this article - but is your code
available?

Thank you,
Mark

On Tue, Mar 24, 2009 at 9:51 AM, Yi-Kai Tsai yi...@yahoo-inc.com wrote:

 hi Mark

 we had done something on top of hadoop/hbase (mapreduce for evaluation ,
 hbase for  online serving )
 by reference http://www2007.org/papers/paper215.pdf

  Hi,

 does anybody know of an open-source implementation of the Broder
 algorithmhttp://www.std.org/%7Emsm/common/clustering.htmlin Hadoop?
 Monika Henzinger reports
 having done http://ltaa.epfl.ch/monika/mpapers/nearduplicates2006.pdf
 so
 in MapReduce, and I wonder if somebody has repeated her work in open
 source?

 I am going to do this if there is no implementation yet, and then I will
 ask
 what I can do with the code.

 Cheers,
 Mark




 --
 Yi-Kai Tsai (cuma) yi...@yahoo-inc.com, Asia Search Engineering.




Software Development Process Help

2009-03-24 Thread Stefan Negrea

Hello Everybody,

Over the past couple of months I documented a new software development process 
for open source projects and communities. The new process attempts to address 
shortcomings of existing major software development processes and build upon 
existing attempts of open source communities. A document with a detailed 
description of the process along with a short presentation of the process can 
be found in the survey described below. This work will be my thesis for a 
Master’s degree in Software Engineering.

I would need help from open source contributors to validate my work. I created 
a simple survey with seven basic questions that would help understand if the 
process is applicable and viable in the open source space. The survey can be 
found here: 
http://spreadsheets.google.com/viewform?hl=enformkey=cFg5UUVKakwyOTJ2eDhNWDM5WUlfVlE6MA
 . The survey is completely anonymous. I apologize in advance for any 
grammatical errors or mistakes; the document is a rough draft (I am editing the 
document everyday).

Any help would be greatly appreciated and acknowledged!

Please feel free to contact me for any additional details or questions. 

Thank you,
Stefan


Help Indexing network traffic

2009-03-24 Thread nga pham
Hi all,

I have a txt file that captured all of my network traffic (IP address,
ports, ect. ), I was wondering if you can help me filter out a particular IP
address.


Thank you,
Nga


virtualization with hadoop

2009-03-24 Thread Vishal Ghawate
Hi,

I have created hadoop cluster on single machine using different vm instances
.

Now will the replication factor be effective also I wanted to know about the
performance of the hdfs.


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Need Help hdfs -How to minimize access Time

2009-03-24 Thread snehal nagmote
Hello Sir,
I am doing mtech in iiit  hyderabad , I am doing research project whose aim
is to develop the scalable storage system For esagu.
The esagu is all about taking the crop images from the fields and store it
in the filesystem and then those images would be accessed by agricultural
scientist to detect the problem, So currently many fields in the A.P. are
using this system,it may go beyond A.Pso we require storage system

1)My problem is we are using hadoop for the storage, but hadoop retrieves
(reads/writes) in 64 mb chunk . these images stored would be very small size
say max 2 to 3 mb, So access time would be larger in case of accessing
images, Can you suggest how this access time can be reduced.Is there
anyother thing we could do to improve the performance like building our own
cache, To what extent it would be feasible or helpful in such kind of
application.
2)Second is does hadoop would be useful for small small data like this, if
not  what tricks we could do to make it usable for such knid of application

Please help, Thanks in advance



Regards,
Snehal Nagmote
IIIT Hyderabad