Hbase python recommended interface

2012-03-14 Thread Håvard Wahl Kongsgård
Hi, anyone with recommendations for a python interface to hbase?
Thrift is one possibility, but is there a library like
https://github.com/pycassa/pycassa ?

-- 
Håvard Wahl Kongsgård
NTNU

http://havard.security-review.net/


Re: Partition classes, how to pass in background information

2012-03-14 Thread Chris White
If your class implements the configurable interface, hadoop will call the
setConf method after creating the instance. Look in the source code for
ReflectionUtils.newInstance for more info
On Mar 14, 2012 2:31 AM, Jane Wayne jane.wayne2...@gmail.com wrote:

 i am using the new org.apache.hadoop.mapreduce.Partitioner class. however,
 i need to pass it some background information. how can i do this?

 in the old, org.apache.hadoop.mapred.Partitioner class (now deprecated),
 this class extends JobConfigurable, and it seems the hook to pass in any
 background data is with the JobConfigurable.configure(JobConf job) method.

 i thought that if i sub-classed org.apache.hadoop.mapreduce.Partitioner, i
 could pass in the background information, however, in the
 org.apache.hadoop.mapreduce.Job class, it only has a
 setPartitionerClass(Class? extends Partitioner) method.

 all my development has been the new mapreduce package, and i would
 definitely desire to stick with the new API/package. any help is
 appreciated.



Re: questions regarding hadoop version 1.0

2012-03-14 Thread Joey Echeverria
JobTracker and TaskTracker. YARN is only in 0.23 and later releases. 1.0.x is 
from the 0.20x line of releases. 

-Joey



On Mar 14, 2012, at 7:00, arindam choudhury arindam732...@gmail.com wrote:

 Hi,
 
 Hadoop 1.0.1 uses hadoop YARN or the tasktracker, jobtracker model?
 
 Regards,
 Arindam


RE: decompressing bzip2 data with a custom InputFormat

2012-03-14 Thread Tony Burton
Hi - sorry to bump this, but I'm having trouble resolving this. 

Essentially the question is: If I create my own InputFormat by subclassing 
TextInputFormat, does the subclass have to handle its own streaming of 
compressed data? If so, can anyone point me at an example where this is done?

Thanks!

Tony







-Original Message-
From: Tony Burton [mailto:tbur...@sportingindex.com] 
Sent: 12 March 2012 18:05
To: common-user@hadoop.apache.org
Subject: decompressing bzip2 data with a custom InputFormat

 Hi,

I'm setting up a map-only job that reads large bzip2-compressed data files, 
parses the XML and writes out the same data in plain text format. My XML 
InputFormat extends TextInputFormat and has a RecordReader based upon the one 
you can see at http://xmlandhadoop.blogspot.com/ (my version of it works great 
for uncompressed XML input data). For compressed data, I've added 
io.compression.codecs to my core-site.xml and set it to 
o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.

Have I forgotten something basic when running a Hadoop job to read compressed 
data? Or, given that I've written my own InputFormat, should I be using an 
InputStream that can carry out the decompression itself?

Thanks

Tony
 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM
www.sportingindex.com
Inbound Email has been scanned for viruses and SPAM 
**
This email and any attachments are confidential, protected by copyright and may 
be legally privileged.  If you are not the intended recipient, then the 
dissemination or copying of this email is prohibited. If you have received this 
in error, please notify the sender by replying by email and then delete the 
email completely from your system.  Neither Sporting Index nor the sender 
accepts responsibility for any virus, or any other defect which might affect 
any computer or IT system into which the email is received and/or opened.  It 
is the responsibility of the recipient to scan the email and no responsibility 
is accepted for any loss or damage arising in any way from receipt or use of 
this email.  Sporting Index Ltd is a company registered in England and Wales 
with company number 2636842, whose registered office is at Brookfield House, 
Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting Index Ltd is 
authorised and regulated by the UK Financial Services Authority (reg. no. 
150404). Any financial promotion contained herein has been issued 
and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM


Re: decompressing bzip2 data with a custom InputFormat

2012-03-14 Thread Joey Echeverria
Yes you have to deal with the compression. Usually, you'll load the
compression codec in your RecordReader. You can see an example of how
TextInputFormat's LineRecordReader does it:

https://github.com/apache/hadoop-common/blob/release-1.0.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java

-Joey

On Wed, Mar 14, 2012 at 11:08 AM, Tony Burton tbur...@sportingindex.com wrote:
 Hi - sorry to bump this, but I'm having trouble resolving this.

 Essentially the question is: If I create my own InputFormat by subclassing 
 TextInputFormat, does the subclass have to handle its own streaming of 
 compressed data? If so, can anyone point me at an example where this is done?

 Thanks!

 Tony







 -Original Message-
 From: Tony Burton [mailto:tbur...@sportingindex.com]
 Sent: 12 March 2012 18:05
 To: common-user@hadoop.apache.org
 Subject: decompressing bzip2 data with a custom InputFormat

  Hi,

 I'm setting up a map-only job that reads large bzip2-compressed data files, 
 parses the XML and writes out the same data in plain text format. My XML 
 InputFormat extends TextInputFormat and has a RecordReader based upon the one 
 you can see at http://xmlandhadoop.blogspot.com/ (my version of it works 
 great for uncompressed XML input data). For compressed data, I've added 
 io.compression.codecs to my core-site.xml and set it to 
 o.a.h.io.compress.BZip2Codec. I'm using Hadoop 0.20.2.

 Have I forgotten something basic when running a Hadoop job to read compressed 
 data? Or, given that I've written my own InputFormat, should I be using an 
 InputStream that can carry out the decompression itself?

 Thanks

 Tony

 **
 This email and any attachments are confidential, protected by copyright and 
 may be legally privileged.  If you are not the intended recipient, then the 
 dissemination or copying of this email is prohibited. If you have received 
 this in error, please notify the sender by replying by email and then delete 
 the email completely from your system.  Neither Sporting Index nor the sender 
 accepts responsibility for any virus, or any other defect which might affect 
 any computer or IT system into which the email is received and/or opened.  It 
 is the responsibility of the recipient to scan the email and no 
 responsibility is accepted for any loss or damage arising in any way from 
 receipt or use of this email.  Sporting Index Ltd is a company registered in 
 England and Wales with company number 2636842, whose registered office is at 
 Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
 Index Ltd is authorised and regulated by the UK Financial Services Authority 
 (reg. no. 150404). Any financial promotion contained herein has been issued
 and approved by Sporting Index Ltd.

 Outbound email has been scanned for viruses and SPAM
 www.sportingindex.com
 Inbound Email has been scanned for viruses and SPAM
 **
 This email and any attachments are confidential, protected by copyright and 
 may be legally privileged.  If you are not the intended recipient, then the 
 dissemination or copying of this email is prohibited. If you have received 
 this in error, please notify the sender by replying by email and then delete 
 the email completely from your system.  Neither Sporting Index nor the sender 
 accepts responsibility for any virus, or any other defect which might affect 
 any computer or IT system into which the email is received and/or opened.  It 
 is the responsibility of the recipient to scan the email and no 
 responsibility is accepted for any loss or damage arising in any way from 
 receipt or use of this email.  Sporting Index Ltd is a company registered in 
 England and Wales with company number 2636842, whose registered office is at 
 Brookfield House, Green Lane, Ivinghoe, Leighton Buzzard, LU7 9ES.  Sporting 
 Index Ltd is authorised and regulated by the UK Financial Services Authority 
 (reg. no. 150404). Any financial promotion contained herein has been issued
 and approved by Sporting Index Ltd.

 Outbound email has been scanned for viruses and SPAM



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Capacity Scheduler APIs

2012-03-14 Thread hdev ml
Hi all,

are there any capacity scheduler apis that I can use?

e.g. adding, removing queues, tuning properties on the fly and so on.

Any help is appreciated.

Thanks

Harshad


Re: Using a combiner

2012-03-14 Thread Prashant Kommireddi
It is a function of the number of spills on map side and I believe
the default is 3. So for every 3 times data is spilled the combiner is
run. This number is configurable.

Sent from my iPhone

On Mar 14, 2012, at 3:26 PM, Gayatri Rao rgayat...@gmail.com wrote:

 Hi all,

 I have a quick query on using a combiner in a MR job. Is it true the
 framework decides whether or not the combiner gets called?
 Can any one please give more information on how t his is done.

 Thanks,
 Gayatri


Re: Partition classes, how to pass in background information

2012-03-14 Thread Jane Wayne
Thanks Chris! That worked!

On Wed, Mar 14, 2012 at 6:06 AM, Chris White chriswhite...@gmail.comwrote:

 If your class implements the configurable interface, hadoop will call the
 setConf method after creating the instance. Look in the source code for
 ReflectionUtils.newInstance for more info
 On Mar 14, 2012 2:31 AM, Jane Wayne jane.wayne2...@gmail.com wrote:

  i am using the new org.apache.hadoop.mapreduce.Partitioner class.
 however,
  i need to pass it some background information. how can i do this?
 
  in the old, org.apache.hadoop.mapred.Partitioner class (now deprecated),
  this class extends JobConfigurable, and it seems the hook to pass in
 any
  background data is with the JobConfigurable.configure(JobConf job)
 method.
 
  i thought that if i sub-classed org.apache.hadoop.mapreduce.Partitioner,
 i
  could pass in the background information, however, in the
  org.apache.hadoop.mapreduce.Job class, it only has a
  setPartitionerClass(Class? extends Partitioner) method.
 
  all my development has been the new mapreduce package, and i would
  definitely desire to stick with the new API/package. any help is
  appreciated.
 



Re: does hadoop always respect setNumReduceTasks?

2012-03-14 Thread Jane Wayne
Thanks Lance.

On Thu, Mar 8, 2012 at 9:38 PM, Lance Norskog goks...@gmail.com wrote:

 Instead of String.hashCode() you can use the MD5 hashcode generator.
 This has not in the wild created a duplicate. (It has been hacked,
 but that's not relevant here.)

 http://snippets.dzone.com/posts/show/3686

 I think the Partitioner class guarantees that you will have multiple
 reducers.

 On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne jane.wayne2...@gmail.com
 wrote:
  i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
 
  as i am emitting items from the mapper, i expect/desire only 1 reducer to
  get these items because i want to assign each key of the key-value input
  pair a unique integer id. if i had 1 reducer, i can just keep a local
  counter (with respect to the reducer instance) and increment it.
 
  on my local hadoop cluster, i noticed that most, if not all, my jobs have
  only 1 reducer, regardless of whether or not i set
  Job.setNumReduceTasks(int).
 
  however, as soon as i moved the code unto amazon's elastic mapreduce
 (emr),
  i notice that there are multiple reducers. if i set the number of reduce
  tasks to 1, is this always guaranteed? i ask because i don't know if
 there
  is a gotcha like the combiner (where it may or may not run at all).
 
  also, it looks like this might not be a good idea just having 1 reducer
 (it
  won't scale). it is most likely better if there are +1 reducers, but in
  that case, i lose the ability to assign unique numbers to the key-value
  pairs coming in. is there a design pattern out there that addresses this
  issue?
 
  my mapper/reducer key-value pair signatures looks something like the
  following.
 
  mapper(Text, Text, Text, IntWritable)
  reducer(Text, IntWritable, IntWritable, Text)
 
  the mapper reads a sequence file whose key-value pairs are of type Text
 and
  Text. i then emit Text (let's say a word) and IntWritable (let's say
  frequency of the word).
 
  the reducer gets the word and its frequencies, and then assigns the word
 an
  integer id. it emits IntWritable (the id) and Text (the word).
 
  i remember seeing code from mahout's API where they assign integer ids to
  items. the items were already given an id of type long. the conversion
 they
  make is as follows.
 
  public static int idToIndex(long id) {
   return 0x7FFF  ((int) id ^ (int) (id  32));
  }
 
  is there something equivalent for Text or a word? i was thinking about
  simply taking the hash value of the string/word, but of course, different
  strings can map to the same hash value.



 --
 Lance Norskog
 goks...@gmail.com



dynamic mapper?

2012-03-14 Thread robert
Suppose I want to generate a mapper class at run time and use that
class in my MapReduce job.

What is the best way to do this? Would I just have an extra scripted
step to pre-compile it and distribute with -libjars, or if I felt like
compiling it dynamically with for example JavaCompiler is there some
elegant way to distribute the class at run time?



SequenceFile split question

2012-03-14 Thread Mohit Anchlia
I have a client program that creates sequencefile, which essentially merges
small files into a big file. I was wondering how is sequence file splitting
the data accross nodes. When I start the sequence file is empty. Does it
get split when it reaches the dfs.block size? If so then does it mean that
I am always writing to just one node at a given point in time?

If I start a new client writing a new sequence file then is there a way to
select a different data node?