Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Mohit Anchlia
You could always write your own properties file and read it as resource.

On Tue, Sep 25, 2012 at 12:10 AM, Hemanth Yamijala yhema...@gmail.comwrote:

 By java environment variables, do you mean the ones passed as
 -Dkey=value ? That's one way of passing them. I suppose another way is
 to have a client side site configuration (like mapred-site.xml) that
 is in the classpath of the client app.

 Thanks
 Hemanth

 On Tue, Sep 25, 2012 at 12:20 AM, Varad Meru meru.va...@gmail.com wrote:
  Thanks Hemanth,
 
  But in general, if we want to pass arguments to any job (not only
  PiEstimator from examples-jar) and submit the Job to the Job queue
  scheduler, by the looks of it, we might always need to use the java
  environment variables only.
 
  Is my above assumption correct?
 
  Thanks,
  Varad
 
  On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.com
 wrote:
 
  Varad,
 
  Looking at the code for the PiEstimator class which implements the
  'pi' example, the two arguments are mandatory and are used *before*
  the job is submitted for execution - i.e on the client side. In
  particular, one of them (nSamples) is used not by the MapReduce job,
  but by the client code (i.e. PiEstimator) to generate some input.
 
  Hence, I believe all of this additional work that is being done by the
  PiEstimator class will be bypassed if we directly use the job -submit
  command. In other words, I don't think these two ways of running the
  job:
 
  - using the hadoop jar examples pi
  - using hadoop job -submit
 
  are equivalent.
 
  As a general answer to your question though, if additional parameters
  are used by the Mappers or reducers, then they will generally be set
  as additional job specific configuration items. So, one way of using
  them with the job -submit command will be to find out the specific
  names of the configuration items (from code, or some other
  documentation), and include them in the job.xml used when submitting
  the job.
 
  Thanks
  Hemanth
 
  On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com
 wrote:
   Hi,
  
   I want to run the PiEstimator example from using the following command
  
   $hadoop job -submit pieestimatorconf.xml
  
   which contains all the info required by hadoop to run the job. E.g.
 the
   input file location, the output file location and other details.
  
  
 
 propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property
   propertynamemapred.map.tasks/namevalue20/value/property
   propertynamemapred.reduce.tasks/namevalue2/value/property
   ...
   propertynamemapred.job.name
  /namevaluePiEstimator/value/property
  
 
 propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property
  
   Now, as we now, to run the PiEstimator, we can use the following
 command
  too
  
   $hadoop jar hadoop-examples.1.0.3 pi 5 10
  
   where 5 and 10 are the arguments to the main class of the PiEstimator.
  How
   can I pass the same arguments (5 and 10) using the job -submit command
   through conf. file or any other way, without changing the code of the
   examples to reflect the use of environment variables.
  
   Thanks in advance,
   Varad
  
   -
   Varad Meru
   Software Engineer,
   Business Intelligence and Analytics,
   Persistent Systems and Solutions Ltd.,
   Pune, India.
 



Re: Number of Maps running more than expected

2012-08-16 Thread Mohit Anchlia
It would be helpful to see some statistics out of both the jobs like bytes
read, written number of errors etc.

On Thu, Aug 16, 2012 at 8:02 PM, Raj Vishwanathan rajv...@yahoo.com wrote:

 You probably have speculative execution on. Extra maps and reduce tasks
 are run in case some of them fail

 Raj


 Sent from my iPad
 Please excuse the typos.

 On Aug 16, 2012, at 11:36 AM, in.abdul in.ab...@gmail.com wrote:

  Hi Gaurav,
Number map is not depents upon number block . It is really depends upon
  number of input splits . If you had 100GB of data and you had 10 split
  means then you can see only 10 maps .
 
  Please correct me if i am wrong
 
  Thanks and regards,
  Syed abdul kather
  On Aug 16, 2012 7:44 PM, Gaurav Dasgupta [via Lucene] 
  ml-node+s472066n4001631...@n3.nabble.com wrote:
 
  Hi users,
 
  I am working on a CDH3 cluster of 12 nodes (Task Trackers running on all
  the 12 nodes and 1 node running the Job Tracker).
  In order to perform a WordCount benchmark test, I did the following:
 
- Executed RandomTextWriter first to create 100 GB data (Note that I
have changed the test.randomtextwrite.total_bytes parameter only,
 rest
all are kept default).
- Next, executed the WordCount program for that 100 GB dataset.
 
  The Block Size in hdfs-site.xml is set as 128 MB. Now, according to
 my
  calculation, total number of Maps to be executed by the wordcount job
  should be 100 GB / 128 MB or 102400 MB / 128 MB = 800.
  But when I am executing the job, it is running a total number of 900
 Maps,
  i.e., 100 extra.
  So, why this extra number of Maps? Although, my job is completing
  successfully without any error.
 
  Again, if I don't execute the RandomTextWwriter job to create data for
  my wordcount, rather I put my own 100 GB text file in HDFS and run
  WordCount, I can then see the number of Maps are equivalent to my
  calculation, i.e., 800.
 
  Can anyone tell me why this odd behaviour of Hadoop regarding the number
  of Maps for WordCount only when the dataset is generated by
  RandomTextWriter? And what is the purpose of these extra number of Maps?
 
  Regards,
  Gaurav Dasgupta
 
 
  --
  If you reply to this email, your message will be added to the discussion
  below:
 
 
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631.html
  To unsubscribe from Lucene, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw
 
  .
  NAML
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 
 
 
 
 
  -
  THANKS AND REGARDS,
  SYED ABDUL KATHER
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/Number-of-Maps-running-more-than-expected-tp4001631p4001683.html
  Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: Basic Question

2012-08-07 Thread Mohit Anchlia
On Tue, Aug 7, 2012 at 11:33 AM, Harsh J ha...@cloudera.com wrote:

 Each write call registers (writes) a KV pair to the output. The output
 collector does not look for similarities nor does it try to de-dupe
 it, and even if the object is the same, its value is copied so that
 doesn't matter.

 So you will get two KV pairs in your output - since duplication is
 allowed and is normal in several MR cases. Think of wordcount, where a
 map() call may emit lots of (is, 1) pairs if there are multiple is
 in the line it processes, and can use set() calls to its benefit to
 avoid too many object creation.


Thanks!


 On Tue, Aug 7, 2012 at 11:56 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  In Mapper I often use a Global Text object and througout the map
 processing
  I just call set on it. My question is, what happens if collector
 receives
  similar byte array value. Does the last one overwrite the value in
  collector? So if I did
 
  Text zip = new Text();
  zip.set(9099);
  collector.write(zip,value);
  zip.set(9099);
  collector.write(zip,value1);
 
  Should I expect to receive both values in reducer or just one?



 --
 Harsh J



Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
I am trying to write a test on local file system but this test keeps taking
xml files in the path even though I am setting a different Configuration
object. Is there a way for me to override it? I thought the way I am doing
overwrites the configuration but doesn't seem to be working:

 @Test
 public void testOnLocalFS() throws Exception{
  Configuration conf = new Configuration();
  conf.set(fs.default.name, file:///);
  conf.set(mapred.job.tracker, local);
  Path input = new Path(geoinput/geo.dat);
  Path output = new Path(geooutput/);
  FileSystem fs = FileSystem.getLocal(conf);
  fs.delete(output, true);

  log.info(Here);
  GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
  configRunner.setConf(conf);
  int exitCode = configRunner.run(new String[]{input.toString(),
output.toString()});
  Assert.assertEquals(exitCode, 0);
 }


Re: Setting Configuration for local file:///

2012-08-07 Thread Mohit Anchlia
On Tue, Aug 7, 2012 at 12:50 PM, Harsh J ha...@cloudera.com wrote:

 What is GeoLookupConfigRunner and how do you utilize the setConf(conf)
 object within it?


Thanks for the pointer I wasn't setting my JobConf object with the conf
that I passed. Just one more related question, if I use JobConf conf = new
JobConf(getConf()) and I don't pass in any configuration then does the data
from xml files in the path used? I want this to work for all the scenarios.



 On Wed, Aug 8, 2012 at 1:10 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am trying to write a test on local file system but this test keeps
 taking
  xml files in the path even though I am setting a different Configuration
  object. Is there a way for me to override it? I thought the way I am
 doing
  overwrites the configuration but doesn't seem to be working:
 
   @Test
   public void testOnLocalFS() throws Exception{
Configuration conf = new Configuration();
conf.set(fs.default.name, file:///);
conf.set(mapred.job.tracker, local);
Path input = new Path(geoinput/geo.dat);
Path output = new Path(geooutput/);
FileSystem fs = FileSystem.getLocal(conf);
fs.delete(output, true);
 
log.info(Here);
GeoLookupConfigRunner configRunner = new GeoLookupConfigRunner();
configRunner.setConf(conf);
int exitCode = configRunner.run(new String[]{input.toString(),
  output.toString()});
Assert.assertEquals(exitCode, 0);
   }



 --
 Harsh J



Local jobtracker in test env?

2012-08-07 Thread Mohit Anchlia
I just wrote a test where fs.default.name is file:/// and
mapred.job.tracker is set to local. The test ran fine, I also see mapper
and reducer were invoked but what I am trying to understand is that how did
this run without specifying the job tracker port and which port task
tracker connected with job tracker. It's not clear from the output:

Also what's the difference between this and bringing up miniDFS cluster?

INFO  org.apache.hadoop.mapred.FileInputFormat [main]: Total input paths to
proc
ess : 1
INFO  org.apache.hadoop.mapred.JobClient [main]: Running job: job_local_0001
INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
ResourceCalculatorPlugin
 : null
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: numReduceTasks: 1
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: io.sort.mb = 100
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: data buffer =
79691776/99614
720
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: record buffer =
262144/32768
0
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 92127
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 1
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 92127
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: z
ip 1
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Starting flush of map
output
INFO  org.apache.hadoop.mapred.MapTask [Thread-11]: Finished spill 0
INFO  org.apache.hadoop.mapred.Task [Thread-11]:
Task:attempt_local_0001_m_0
0_0 is done. And is in the process of commiting
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
file:/c:/upb/dp/manch
lia-dp/depot/services/data-platform/trunk/analytics/geoinput/geo.dat:0+18
INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
'attempt_local_0001_m_
00_0' done.
INFO  org.apache.hadoop.mapred.Task [Thread-11]:  Using
ResourceCalculatorPlugin
 : null
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Merging 1 sorted segments
INFO  org.apache.hadoop.mapred.Merger [Thread-11]: Down to the last
merge-pass,
with 1 segments left of total size: 26 bytes
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: I
nside reduce
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [Thread-11]: O
utside reduce
INFO  org.apache.hadoop.mapred.Task [Thread-11]:
Task:attempt_local_0001_r_0
0_0 is done. And is in the process of commiting
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]:
INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
attempt_local_0001_r_0
0_0 is allowed to commit now
INFO  org.apache.hadoop.mapred.FileOutputCommitter [Thread-11]: Saved
output of
task 'attempt_local_0001_r_00_0' to
file:/c:/upb/dp/manchlia-dp/depot/servic
es/data-platform/trunk/analytics/geooutput
INFO  org.apache.hadoop.mapred.LocalJobRunner [Thread-11]: reduce  reduce
INFO  org.apache.hadoop.mapred.Task [Thread-11]: Task
'attempt_local_0001_r_
00_0' done.
INFO  org.apache.hadoop.mapred.JobClient [main]:  map 100% reduce 100%
INFO  org.apache.hadoop.mapred.JobClient [main]: Job complete:
job_local_0001
INFO  org.apache.hadoop.mapred.JobClient [main]: Counters: 15
INFO  org.apache.hadoop.mapred.JobClient [main]:   FileSystemCounters
INFO  org.apache.hadoop.mapred.JobClient [main]: FILE_BYTES_READ=458
INFO  org.apache.hadoop.mapred.JobClient [main]:
FILE_BYTES_WRITTEN=96110
INFO  org.apache.hadoop.mapred.JobClient [main]:   Map-Reduce Framework
INFO  org.apache.hadoop.mapred.JobClient [main]: Map input records=2
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce shuffle bytes=0
INFO  org.apache.hadoop.mapred.JobClient [main]: Spilled Records=4
INFO  org.apache.hadoop.mapred.JobClient [main]: Map output bytes=20
INFO  org.apache.hadoop.mapred.JobClient [main]: Total committed heap
usage
(bytes)=321527808
INFO  org.apache.hadoop.mapred.JobClient [main]: Map input bytes=18
INFO  org.apache.hadoop.mapred.JobClient [main]: SPLIT_RAW_BYTES=142
INFO  org.apache.hadoop.mapred.JobClient [main]: Combine input records=0
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce input records=2
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce input groups=1
INFO  org.apache.hadoop.mapred.JobClient [main]: Combine output
records=0
INFO  org.apache.hadoop.mapred.JobClient [main]: Reduce output records=1
INFO  org.apache.hadoop.mapred.JobClient [main]: Map output records=2
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Inside
 reduce
INFO  com.i.cg.services.dp.analytics.hadoop.mapred.GeoLookup [main]: Outsid
e reduce
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.547 sec
Results :
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0


Re: Avro

2012-08-05 Thread Mohit Anchlia
On Sat, Aug 4, 2012 at 11:43 PM, Nitin Kesarwani bumble@gmail.comwrote:

 Mohit,

 You can use this patch to suit your need:
 https://issues.apache.org/jira/browse/PIG-2579

 New fields in Avro schema descriptor file need to have a non-null default
 value. Hence, using the new schema file, you should be able to read older
 data as well. Try it out. It is very straight forward.

 Hope this helps!


Thanks! I am new to Avro what's the best place to see some examples of how
Avro deals with schema changes? I am trying to find some examples.


 On Sun, Aug 5, 2012 at 12:01 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 I've heard that Avro provides a good way of dealing with changing schemas.
  I am not sure how it could be done without keeping some kind of structure
  along with the data. Are there any good examples and documentation that I
  can look at?
 

 -N



Compression and Decompression

2012-07-05 Thread Mohit Anchlia
Is the compression done on the client side or on the server side? If I run
hadoop fs -text then is this client decompressing the file for me?


Dealing with changing file format

2012-07-02 Thread Mohit Anchlia
I am wondering what's the right way to go about designing reading input and
output where file format may change over period. For instance we might
start with field1,field2,field3 but at some point we add new field4 in
the input. What's the best way to deal with such scenarios? Keep a catalog
of changes that timestamped?


Re: Sync and Data Replication

2012-06-10 Thread Mohit Anchlia
On Sun, Jun 10, 2012 at 9:39 AM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 On Sat, Jun 9, 2012 at 11:11 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  Thanks Harsh for detailed info. It clears things up. Only thing from
 those
  page is concerning is what happens when client crashes. It says you could
  lose upto a block worth of information. Is this still true given that NN
  would auto close the file?

 Where does it say this exactly? It is true that immediate readers will
 not get the last block (as it remains open and uncommitted), but once
 the lease recovery kicks in the file is closed successfully and the
 last block is indeed made available, so there's no 'data loss'.


I saw it in Coherency Model - consequences of application design
paragraph.

Thanks for the information. It at least helps me in that I don't have to
worry about the data loss when sync is not closed.


  Is it a good practice to reduce NN default value so that it auto-closes
  before 1 hr.

 I've not seen people do this/need to do this. Most don't run into such
 a situation and it is vital to properly close() files or sync() on
 file streams before making it available to readers. HBase manages open
 files during WAL-recovery using lightweight recoverLease APIs that
 were added for its benefit, so it doesn't need to wait for an hour for
 WALs to close and recover data.

 --
 Harsh J



Re: Sync and Data Replication

2012-06-09 Thread Mohit Anchlia
Thanks Harsh for detailed info. It clears things up. Only thing from those
page is concerning is what happens when client crashes. It says you could
lose upto a block worth of information. Is this still true given that NN
would auto close the file?

Is it a good practice to reduce NN default value so that it auto-closes
before 1 hr.

Regarding OS cache, I think it should be ok since chances of loosing
replica nodes all at the same time is low.
On Sat, Jun 9, 2012 at 5:13 AM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

  In this scenario is data also replicated as defined by the replication
 factor to other nodes as well? I am wondering if at this point if crash
 occurs do I have data in other nodes?

 What kind of crash are you talking about here? A client crash or a
 cluster crash? If a cluster, is the loss you're thinking of one DN or
 all the replicating DNs?

 If client fails to close a file due to a crash, it is auto-closed
 later (default is one hour) by the NameNode and whatever the client
 successfully wrote (i.e. into its last block) is then made available
 to readers at that point. If the client synced, then its last sync
 point is always available to readers and whatever it didn't sync is
 made available when the file is closed later by the NN. For DN
 failures, read on.

 Replication in 1.x/0.20.x is done via pipelines. Its done regardless
 of sync() calls. All write packets are indeed sent to and acknowledged
 by each DN in the constructed pipeline as the write progresses. For a
 good diagram on the sequence here, see Figure 3.3 | Page 66 | Chapter
 3: The Hadoop Distributed Filesystem, in Tom's Hadoop: The Definitive
 Guide (2nd ed. page nos. Gotta get 3rd ed. soon :))

 The sync behavior is further explained under the 'Coherency Model'
 title at Page 68 | Chapter 3: The Hadoop Distributed Filesystem of the
 same book. Think of sync() more as a checkpoint done over the write
 pipeline, such that new readers can read the length of synced bytes
 immediately and that they are guaranteed to be outside of the DN
 application (JVM) buffers (i.e. flushed).

 Some further notes, for general info: In 0.20.x/1.x releases, there's
 no hard-guarantee that the write buffer flushing done via sync ensures
 the data went to the *disk*. It may remain in the OS buffers (a
 feature in OSes, for performance). This is cause we do not do an
 fsync() (i.e. calling force on the FileChannel for the block and
 metadata outputs), but rather just an output stream flush. In the
 future, via 2.0.1-alpha release (soon to come at this point) and
 onwards, the specific call hsync() will ensure that this is not the
 case.

 However, if you are OK with the OS buffers feature/caveat and
 primarily need syncing not for reliability but for readers, you may
 use the call hflush() and save on performance. One place where hsync()
 is to be preferred instead of hflush() is where you use WALs (for data
 reliability), and HBase is one such application. With hsync(), HBase
 can survive potential failures caused by major power failure cases
 (among others).

 Let us know if this clears it up for you!

 On Sat, Jun 9, 2012 at 4:58 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am wondering the role of sync in replication of data to other nodes.
 Say
  client writes a line to a file in Hadoop, at this point file handle is
 open
  and sync has not been called. In this scenario is data also replicated as
  defined by the replication factor to other nodes as well? I am wondering
 if
  at this point if crash occurs do I have data in other nodes?



 --
 Harsh J



Sync and Data Replication

2012-06-08 Thread Mohit Anchlia
I am wondering the role of sync in replication of data to other nodes. Say
client writes a line to a file in Hadoop, at this point file handle is open
and sync has not been called. In this scenario is data also replicated as
defined by the replication factor to other nodes as well? I am wondering if
at this point if crash occurs do I have data in other nodes?


Ideal file size

2012-06-06 Thread Mohit Anchlia
We have continuous flow of data into the sequence file. I am wondering what
would be the ideal file size before file gets rolled over. I know too many
small files are not good but could someone tell me what would be the ideal
size such that it doesn't overload NameNode.


Re: Ideal file size

2012-06-06 Thread Mohit Anchlia
On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote:

 Many factors to consider than just the size of the file.  . How long can
 you wait before you *have to* process the data?  5 minutes? 5 hours? 5
 days?  If you want good timeliness, you need to roll-over faster.  The
 longer you wait:

 1.  the lesser the load on the NN.
 2.  but the poorer the timeliness
 3.  and the larger chance of lost data  (ie, the data is not saved until
 the file is closed and rolled over, unless you want to sync() after every
 write)

 To Begin with I was going to use Flume and specify rollover file size. I
understand the above parameters, I just want to ensure that too many small
files doesn't cause problem on the NameNode. For instance there would be
times when we get GBs of data in an hour and at times only few 100 MB. From
what Harsh, Edward and you've described it doesn't cause issues with the
NameNode but rather increase in processing times if there are too many
small files. Looks like I need to find that balance.

It would also be interesting to see how others solve this problem when not
using Flume.




 On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  We have continuous flow of data into the sequence file. I am wondering
 what
  would be the ideal file size before file gets rolled over. I know too
 many
  small files are not good but could someone tell me what would be the
 ideal
  size such that it doesn't overload NameNode.
 



Re: Writing click stream data to hadoop

2012-05-30 Thread Mohit Anchlia
On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).

 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.


Thanks Harsh, Does flume also provides API on top. I am getting this data
as http call, how would I go about using flume with http calls?


 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  We get click data through API calls. I now need to send this data to our
  hadoop environment. I am wondering if I could open one sequence file and
  write to it until it's of certain size. Once it's over the specified
 size I
  can close that file and open a new one. Is this a good approach?
 
  Only thing I worry about is what happens if the server crashes before I
 am
  able to cleanly close the file. Would I lose all previous data?



 --
 Harsh J



Re: Bad connect ack with firstBadLink

2012-05-04 Thread Mohit Anchlia
Please see:

http://hbase.apache.org/book.html#dfs.datanode.max.xcievers

On Fri, May 4, 2012 at 5:46 AM, madhu phatak phatak@gmail.com wrote:

 Hi,
 We are running a three node cluster . From two days whenever we copy file
 to hdfs , it is throwing  java.IO.Exception Bad connect ack with
 firstBadLink . I searched in net, but not able to resolve the issue. The
 following is the stack trace from datanode log

 2012-05-04 18:08:08,868 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
 blk_-7520371350112346377_50118 received exception java.net.SocketException:
 Connection reset
 2012-05-04 18:08:08,869 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
 172.23.208.17:50010,
 storageID=DS-1340171424-172.23.208.17-50010-1334672673051, infoPort=50075,
 ipcPort=50020):DataXceiver
 java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.read(DataInputStream.java:132)
at

 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262)
at

 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
at

 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
at

 org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
at

 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
at

 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:662)


 It will be great if some one can point to the direction how to solve this
 problem.

 --
 https://github.com/zinnia-phatak-dev/Nectar



Compressing map only output

2012-04-30 Thread Mohit Anchlia
Is there a way to compress map only jobs to compress map output that gets
stored on hdfs as part-m-* files? In pig I used :

Would these work form plain map reduce jobs as well?


set output.compression.enabled true;

set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;


Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
Thanks! When I tried to search for this property I couldn't find it. Is
there a page that has complete list of properties and it's usage?

On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Yes. These are hadoop properties - using set is just a way for Pig to set
 those properties in your job conf.


 On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Is there a way to compress map only jobs to compress map output that gets
  stored on hdfs as part-m-* files? In pig I used :
 
  Would these work form plain map reduce jobs as well?
 
 
  set output.compression.enabled true;
 
  set output.compression.codec org.apache.hadoop.io.compress.SnappyCodec;
 



Re: Compressing map only output

2012-04-30 Thread Mohit Anchlia
Thanks a lot for the link!

On Mon, Apr 30, 2012 at 8:22 PM, Harsh J ha...@cloudera.com wrote:

 Hey Mohit,

 Most of what you need to know for jobs is available at
 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

 A more complete, mostly unseparated list of config params are also
 available at:
 http://hadoop.apache.org/common/docs/current/mapred-default.html
 (core-default.html, hdfs-default.html)

 On Tue, May 1, 2012 at 6:36 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  Thanks! When I tried to search for this property I couldn't find it. Is
  there a page that has complete list of properties and it's usage?
 
  On Mon, Apr 30, 2012 at 5:44 PM, Prashant Kommireddi 
 prash1...@gmail.comwrote:
 
  Yes. These are hadoop properties - using set is just a way for Pig to
 set
  those properties in your job conf.
 
 
  On Mon, Apr 30, 2012 at 5:25 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   Is there a way to compress map only jobs to compress map output that
 gets
   stored on hdfs as part-m-* files? In pig I used :
  
   Would these work form plain map reduce jobs as well?
  
  
   set output.compression.enabled true;
  
   set output.compression.codec
 org.apache.hadoop.io.compress.SnappyCodec;
  
 



 --
 Harsh J



Re: DFSClient error

2012-04-29 Thread Mohit Anchlia
Thanks for the quick response, appreciate it. It looks like this might be
the issue. But I am still trying to understand what is causing so many
threads in my situation? Is this thread per block that gets created or per
file? Because if it's per file then it should not be more than 15.
My second question, I read around 5 .gz files in 5 separate processed. This
is constant and also the size of those 5 is roughly equivalent. So then why
does it fail only halfway and not right in the begining. I am reading
around 400 files and it always fails when I reach around 180th file.

What's the default value of xceivers? Is 4096 consume too much of stack
size?

Thanks
On Sun, Apr 29, 2012 at 1:14 PM, Harsh J ha...@cloudera.com wrote:

 It sounds to me like you're running out of DN xceivers. Try the
 solution offered at
 http://hbase.apache.org/book.html#dfs.datanode.max.xcievers

 I.e., add:

 property
namedfs.datanode.max.xcievers/name
value4096/value
  /property

 To your DNs' config/hdfs-site.xml and restart the DNs.

 On Mon, Apr 30, 2012 at 1:35 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I even tried to lower number of parallel jobs even further but I still
 get
  these errors. Any suggestion on how to troubleshoot this issue would be
  very helpful. Should I run hadoop fsck? How do people troubleshoot such
  issues?? Does it sound like a bug?
 
  2012-04-27 14:37:42,921 [main] INFO
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 1 map-reduce job(s) waiting for submission.
  2012-04-27 14:37:42,931 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Exception in createBlockOutputStream 125.18.62.199:50010
 java.io.EOFException
  2012-04-27 14:37:42,932 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Abandoning block blk_6343044536824463287_24619
  2012-04-27 14:37:42,932 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Excluding datanode 125.18.62.199:50010
  2012-04-27 14:37:42,935 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Exception in createBlockOutputStream 125.18.62.204:50010
 java.io.EOFException
  2012-04-27 14:37:42,935 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Abandoning block blk_2837215798109471362_24620
  2012-04-27 14:37:42,936 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Excluding datanode 125.18.62.204:50010
  2012-04-27 14:37:42,937 [main] INFO
 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  - 1 map-reduce job(s) waiting for submission.
  2012-04-27 14:37:42,939 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Exception in createBlockOutputStream 125.18.62.198:50010
 java.io.EOFException
  2012-04-27 14:37:42,939 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Abandoning block blk_2223489090936415027_24620
  2012-04-27 14:37:42,940 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Excluding datanode 125.18.62.198:50010
  2012-04-27 14:37:42,943 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Exception in createBlockOutputStream 125.18.62.197:50010
 java.io.EOFException
  2012-04-27 14:37:42,943 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Abandoning block blk_1265169201875643059_24620
  2012-04-27 14:37:42,944 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Excluding datanode 125.18.62.197:50010
  2012-04-27 14:37:42,945 [Thread-5] WARN
  org.apache.hadoop.hdfs.DFSClient -
  DataStreamer Exception: java.io.IOException: Unable to create new block.
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3446)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2627)
 at
 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2822)
  2012-04-27 14:37:42,945 [Thread-5] WARN
  org.apache.hadoop.hdfs.DFSClient -
  Error Recovery for block blk_1265169201875643059_24620 bad datanode[0]
  nodes == null
  2012-04-27 14:37:42,945 [Thread-5] WARN
  org.apache.hadoop.hdfs.DFSClient -
  Could not get block locations. Source file
 
 /tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411/job.jar
  - Aborting...
  2012-04-27 14:37:42,945 [Thread-4] INFO
  org.apache.hadoop.mapred.JobClient
  - Cleaning up the staging area
 
 hdfs://dsdb1:54310/tmp/hadoop-hadoop/mapred/staging/hadoop/.staging/job_201204261707_0411
  2012-04-27 14:37:42,945 [Thread-4] ERROR
  org.apache.hadoop.security.UserGroupInformation -
  PriviledgedActionException as:hadoop (auth:SIMPLE)
  cause:java.io.EOFException
  2012-04-27 14:37:42,996 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Exception in createBlockOutputStream
  125.18.62.200:50010java.io.IOException: Bad connect ack with
  firstBadLink as
   125.18.62.198:50010
  2012-04-27 14:37:42,996 [Thread-5] INFO
  org.apache.hadoop.hdfs.DFSClient -
  Abandoning block blk_-7583284266913502018_24621
  2012-04-27 14:37:42,997 [Thread-5] INFO

Re: DFSClient error

2012-04-27 Thread Mohit Anchlia
After all the jobs fail I can't run anything. Once I restart the cluster I
am able to run other jobs with no problems, hadoop fs and other io
intensive jobs run just fine.

On Fri, Apr 27, 2012 at 3:12 PM, John George john...@yahoo-inc.com wrote:

 Can you run a regular 'hadoop fs' (put orls or get) command?
 If yes, how about a wordcount example?
 'path/hadoop jar pathhadoop-*examples*.jar wordcount input output'


 -Original Message-
 From: Mohit Anchlia mohitanch...@gmail.com
 Reply-To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Date: Fri, 27 Apr 2012 14:36:49 -0700
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Subject: Re: DFSClient error

 I even tried to reduce number of jobs but didn't help. This is what I see:
 
 datanode logs:
 
 Initializing secure datanode resources
 Successfully obtained privileged resources (streaming port =
 ServerSocket[addr=/0.0.0.0,localport=50010] ) (http listener port =
 sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:50075])
 Starting regular datanode initialization
 26/04/2012 17:06:51 9858 jsvc.exec error: Service exit with a return value
 of 143
 
 userlogs:
 
 2012-04-26 19:35:22,801 WARN
 org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is
 available
 2012-04-26 19:35:22,801 INFO
 org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library
 loaded
 2012-04-26 19:35:22,808 INFO
 org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded 
 initialized native-zlib library
 2012-04-26 19:35:22,903 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
 connect to /125.18.62.197:50010, add to deadNodes and continue
 java.io.EOFException
 at java.io.DataInputStream.readShort(DataInputStream.java:298)
 at
 org.apache.hadoop.hdfs.DFSClient$RemoteBlockReader.newBlockReader(DFSClien
 t.java:1664)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockReader(DFSClient.j
 ava:2383)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java
 :2056)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2170)
 at java.io.DataInputStream.read(DataInputStream.java:132)
 at
 org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(Decompr
 essorStream.java:97)
 at
 org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorSt
 ream.java:87)
 at
 org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.j
 ava:75)
 at java.io.InputStream.read(InputStream.java:85)
 at
 org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205)
 at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169)
 at
 org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRe
 cordReader.java:114)
 at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:109)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordRead
 er.nextKeyValue(PigRecordReader.java:187)
 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapT
 ask.java:456)
 at
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
 java:1157)
 at org.apache.hadoop.mapred.Child.main(Child.java:264)
 2012-04-26 19:35:22,906 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
 connect to /125.18.62.204:50010, add to deadNodes and continue
 java.io.EOFException
 
 namenode logs:
 
 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker: Job
 job_201204261140_0244 added successfully for user 'hadoop' to queue
 'default'
 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobTracker:
 Initializing job_201204261140_0244
 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.AuditLogger:
 USER=hadoop  IP=125.18.62.196OPERATION=SUBMIT_JOB
 TARGET=job_201204261140_0244RESULT=SUCCESS
 2012-04-26 16:12:53,562 INFO org.apache.hadoop.mapred.JobInProgress:
 Initializing job_201204261140_0244
 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Exception
 in
 createBlockOutputStream 125.18.62.198:50010 java.io.IOException: Bad
 connect ack with firstBadLink as 125.18.62.197:50010
 2012-04-26 16:12:53,581 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
 block blk_2499580289951080275_22499
 2012-04-26 16:12:53,582 INFO org.apache.hadoop.hdfs.DFSClient: Excluding
 datanode 125.18.62.197:50010
 2012-04-26 16:12:53,594 INFO

Re: Design question

2012-04-26 Thread Mohit Anchlia
Ant suggestion or pointers would be helpful. Are there any best practices?

On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I just wanted to check how do people design their storage directories for
 data that is sent to the system continuously. For eg: for a given
 functionality we get data feed continuously writen to sequencefile, that is
 then coverted to more structured format using map reduce and stored in tab
 separated files. For such continuous feed what's the best way to organize
 directories and the names? Should it be just based of timestamp or
 something better that helps in organizing data.

 Second part of question, is it better to store output in sequence files so
 that we can take advantage of compression per record. This seems to be
 required since gzip/snappy compression of entire file would launch only one
 map tasks.

 And the last question, when compressing a flat file should it first be
 split into multiple files so that we get multiple mappers if we need to run
 another job on this file? LZO is another alternative but then it requires
 additional configuration, is it preferred?

 Any articles or suggestions would be very helpful.



DFSClient error

2012-04-26 Thread Mohit Anchlia
I had 20 mappers in parallel reading 20 gz files and each file around
30-40MB data over 5 hadoop nodes and then writing to the analytics
database. Almost midway it started to get this error:


2012-04-26 16:13:53,723 [Thread-8] INFO org.apache.hadoop.hdfs.DFSClient -
Exception in createBlockOutputStream
17.18.62.192:50010java.io.IOException: Bad connect ack with
firstBadLink as
17.18.62.191:50010

I am trying to look at the logs but doesn't say much. What could be the
reason? We are in pretty closed reliable network and all machines are up.


Design question

2012-04-23 Thread Mohit Anchlia
I just wanted to check how do people design their storage directories for
data that is sent to the system continuously. For eg: for a given
functionality we get data feed continuously writen to sequencefile, that is
then coverted to more structured format using map reduce and stored in tab
separated files. For such continuous feed what's the best way to organize
directories and the names? Should it be just based of timestamp or
something better that helps in organizing data.

Second part of question, is it better to store output in sequence files so
that we can take advantage of compression per record. This seems to be
required since gzip/snappy compression of entire file would launch only one
map tasks.

And the last question, when compressing a flat file should it first be
split into multiple files so that we get multiple mappers if we need to run
another job on this file? LZO is another alternative but then it requires
additional configuration, is it preferred?

Any articles or suggestions would be very helpful.


Re: Get Current Block or Split ID, and using it, the Block Path

2012-04-08 Thread Mohit Anchlia
I think if you called getInputFormat on JobConf and then called getSplits
you would atleast get the locations.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

On Sun, Apr 8, 2012 at 9:16 AM, Deepak Nettem deepaknet...@gmail.comwrote:

 Hi,

 Is it possible to get the 'id' of the currently executing split or block
 from within the mapper? Using this block Id / split id, I want to be able
 to query the namenode to get the names of hosts having that block / spllit,
 and the actual path to the data.

 I need this for some analytics that I'm doing. Is there a client API that
 allows doing this?  If not, what's the best way to do this?

 Best,
 Deepak Nettem



Re: Doubt from the book Definitive Guide

2012-04-05 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Hi Mohit,

 What would be the advantage? Reducers in most cases read data from all
 the mappers. In the case where mappers were to write to HDFS, a
 reducer would still require to read data from other datanodes across
 the cluster.


Only advantage I was thinking of was that in some cases reducers might be
able to take advantage of data locality and avoid multiple HTTP calls, no?
Data is anyways written, so last merged file could go on HDFS instead of
local disk.
I am new to hadoop so just asking question to understand the rational
behind using local disk for final output.

 Prashant

 On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

  On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:
 
  Hi Mohit,
 
  On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
  using
  HTTP call. But the description under The Reduce Side doesn't
  specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the
 call
  http:// or hdfs://
 
  The flow is simple as this:
  1. For M+R job, map completes its task after writing all partitions
  down into the tasktracker's local filesystem (under mapred.local.dir
  directories).
  2. Reducers fetch completion locations from events at JobTracker, and
  query the TaskTracker there to provide it the specific partition it
  needs, which is done over the TaskTracker's HTTP service (50060).
 
  So to clear things up - map doesn't send it to reduce, nor does reduce
  ask the actual map task. It is the task tracker itself that makes the
  bridge here.
 
  Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
  be over Netty connections. This would be much more faster and
  reliable.
 
  2) My understanding was that mapper output gets written to hdfs, since
  I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS
 then
  shouldn't reducers simply read it from hdfs instead of making http
 calls
  to
  tasktrackers location?
 
  A map-only job usually writes out to HDFS directly (no sorting done,
  cause no reducer is involved). If the job is a map+reduce one, the
  default output is collected to local filesystem for partitioning and
  sorting at map end, and eventually grouping at reduce end. Basically:
  Data you want to send to reducer from mapper goes to local FS for
  multiple actions to be performed on them, other data may directly go
  to HDFS.
 
  Reducers currently are scheduled pretty randomly but yes their
  scheduling can be improved for certain scenarios. However, if you are
  pointing that map partitions ought to be written to HDFS itself (with
  replication or without), I don't see performance improving. Note that
  the partitions aren't merely written but need to be sorted as well (at
  either end). To do that would need ability to spill frequently (cause
  we don't have infinite memory to do it all in RAM) and doing such a
  thing on HDFS would only mean slowdown.
 
  Thanks for clearing my doubts. In this case I was merely suggesting that
  if the mapper output (merged output in the end or the shuffle output) is
  stored in HDFS then reducers can just retrieve it from HDFS instead of
  asking tasktracker for it. Once reducer threads read it they can continue
  to work locally.
 
 
 
  I hope this helps clear some things up for you.
 
  --
  Harsh J
 



Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
I am going through the chapter How mapreduce works and have some
confusion:

1) Below description of Mapper says that reducers get the output file using
HTTP call. But the description under The Reduce Side doesn't specifically
say if it's copied using HTTP. So first confusion, Is the output copied
from mapper - reducer or from reducer - mapper? And second, Is the call
http:// or hdfs://

2) My understanding was that mapper output gets written to hdfs, since I've
seen part-m-0 files in hdfs. If mapper output is written to HDFS then
shouldn't reducers simply read it from hdfs instead of making http calls to
tasktrackers location?



- from the book ---
Mapper
The output file’s partitions are made available to the reducers over HTTP.
The number of worker threads used to serve the file partitions is
controlled by the tasktracker.http.threads property
this setting is per tasktracker, not per map task slot. The default of 40
may need increasing for large clusters running large jobs.6.4.2.

The Reduce Side
Let’s turn now to the reduce part of the process. The map output file is
sitting on the local disk of the tasktracker that ran the map task
(note that although map outputs always get written to the local disk of the
map tasktracker, reduce outputs may not be), but now it is needed by the
tasktracker
that is about to run the reduce task for the partition. Furthermore, the
reduce task needs the map output for its particular partition from several
map tasks across the cluster.
The map tasks may finish at different times, so the reduce task starts
copying their outputs as soon as each completes. This is known as the copy
phase of the reduce task.
The reduce task has a small number of copier threads so that it can fetch
map outputs in parallel.
The default is five threads, but this number can be changed by setting the
mapred.reduce.parallel.copies property.


Re: Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
 using
  HTTP call. But the description under The Reduce Side doesn't
 specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the call
  http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

  2) My understanding was that mapper output gets written to hdfs, since
 I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS then
  shouldn't reducers simply read it from hdfs instead of making http calls
 to
  tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
if the mapper output (merged output in the end or the shuffle output) is
stored in HDFS then reducers can just retrieve it from HDFS instead of
asking tasktracker for it. Once reducer threads read it they can continue
to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J



Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Could someone please help me answer this question?

On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 What is the corresponding system property for setNumTasks? Can it be used
 explicitly as system property like mapred.tasks.?


Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's
confusing as to what it's purpose is for? I tried setting it for my job
still I see more map tasks running than *mapred.map.tasks*

On Thu, Mar 22, 2012 at 7:53 AM, Harsh J ha...@cloudera.com wrote:

 There isn't such an API as setNumTasks. There is however,
 setNumReduceTasks, which sets mapred.reduce.tasks.

 Does this answer your question?

 On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  Could someone please help me answer this question?
 
  On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  What is the corresponding system property for setNumTasks? Can it be
 used
  explicitly as system property like mapred.tasks.?



 --
 Harsh J



Re: SequenceFile split question

2012-03-15 Thread Mohit Anchlia
Thanks! that helps. I am reading small xml files from external file system
and then writing to the SequenceFile. I made it stand alone client thinking
that mapreduce may not be the best way to do this type of writing. My
understanding was that map reduce is best suited for processing data within
HDFS. Is map reduce also one of the options I should consider?

On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Mohit
  If you are using a stand alone client application to do the same
 definitely there is just one instance of the same running and you'd be
 writing the sequence file to one hdfs block at a time. Once it reaches hdfs
 block size the writing continues to next block, in the mean time the first
 block is replicated. If you are doing the same job distributed as map
 reduce you'd be writing to to n files at a time when n is the number of
 tasks in your map reduce job.
 AFAIK the data node where the blocks have to be placed is determined
 by hadoop it is not controlled by end user application. But if you are
 triggering the stand alone job on a particular data node and if it has
 space one replica would be stored in the same. Same applies in case of MR
 tasks as well.

 Regards
 Bejoy.K.S

 On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I have a client program that creates sequencefile, which essentially
 merges
  small files into a big file. I was wondering how is sequence file
 splitting
  the data accross nodes. When I start the sequence file is empty. Does it
  get split when it reaches the dfs.block size? If so then does it mean
 that
  I am always writing to just one node at a given point in time?
 
  If I start a new client writing a new sequence file then is there a way
 to
  select a different data node?
 



Re: EOFException

2012-03-15 Thread Mohit Anchlia
This is actually just hadoop job over HDFS. I am assuming you also know why
this is erroring out?

On Thu, Mar 15, 2012 at 1:02 PM, Gopal absoft...@gmail.com wrote:

  On 03/15/2012 03:06 PM, Mohit Anchlia wrote:

 When I start a job to read data from HDFS I start getting these errors.
 Does anyone know what this means and how to resolve it?

 2012-03-15 10:41:31,402 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.204:50010java.io.**
 EOFException
 2012-03-15 10:41:31,402 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-6402969611996946639_11837
 2012-03-15 10:41:31,403 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.204:50010
 2012-03-15 10:41:31,406 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.198:50010java.io.**
 EOFException
 2012-03-15 10:41:31,406 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-5442664108986165368_11838
 2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.197:50010java.io.**
 EOFException
 2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-3373089616877234160_11838
 2012-03-15 10:41:31,407 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.198:50010
 2012-03-15 10:41:31,409 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.197:50010
 2012-03-15 10:41:31,410 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.204:50010java.io.**
 EOFException
 2012-03-15 10:41:31,410 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_4481292025401332278_11838
 2012-03-15 10:41:31,411 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.204:50010
 2012-03-15 10:41:31,412 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.200:50010java.io.**
 EOFException
 2012-03-15 10:41:31,412 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-5326771177080888701_11838
 2012-03-15 10:41:31,413 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.200:50010
 2012-03-15 10:41:31,414 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.197:50010java.io.**
 EOFException
 2012-03-15 10:41:31,414 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-8073750683705518772_11839
 2012-03-15 10:41:31,415 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.197:50010
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.199:50010java.io.**
 EOFException
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Exception in createBlockOutputStream 164.28.62.198:50010java.io.**
 EOFException
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_441003866688859169_11838
 2012-03-15 10:41:31,416 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Abandoning block blk_-466858474055876377_11839
 2012-03-15 10:41:31,417 [Thread-5] INFO  org.apache.hadoop.hdfs.**DFSClient
 -
 Excluding datanode 164.28.62.198:50010
 2012-03-15 10:41:31,417 [Thread-5] WARN  org.apache.hadoop.hdfs.**DFSClient
 -


 Try shutting down and  restarting hbase.



SequenceFile split question

2012-03-14 Thread Mohit Anchlia
I have a client program that creates sequencefile, which essentially merges
small files into a big file. I was wondering how is sequence file splitting
the data accross nodes. When I start the sequence file is empty. Does it
get split when it reaches the dfs.block size? If so then does it mean that
I am always writing to just one node at a given point in time?

If I start a new client writing a new sequence file then is there a way to
select a different data node?


Re: mapred.tasktracker.map.tasks.maximum not working

2012-03-10 Thread Mohit Anchlia
Thanks. Looks like there are some parameters that I can use at client level
and others need cluster wide setting. Is there a place where I can see all
the config parameters with description of level of changes that can be done
at client level vs at cluster level?

On Fri, Mar 9, 2012 at 10:39 PM, bejoy.had...@gmail.com wrote:

 Adding on to Chen's response.

 This is a setting meant at Task Tracker level(environment setting based on
 parameters like your CPU cores, memory etc) and you need to override the
 same at each task tracker's mapred-site.xml and restart the TT daemon for
 changes to be in effect.

 Regards
 Bejoy K S

 From handheld, Please excuse typos.

 -Original Message-
 From: Chen He airb...@gmail.com
 Date: Fri, 9 Mar 2012 20:16:23
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: mapred.tasktracker.map.tasks.maximum not working

 you set the  mapred.tasktracker.map.tasks.maximum  in your job means
 nothing. Because Hadoop mapreduce platform only checks this parameter when
 it starts. This is a system configuration.

  You need to set it in your conf/mapred-site.xml file and restart your
 hadoop mapreduce.


 On Fri, Mar 9, 2012 at 7:32 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I
 have 5
  nodes. I was expecting this to have only 10 concurrent jobs. But I have
 30
  mappers running. Does hadoop ignores this setting when supplied from the
  job?
 




mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
What's the difference between mapred.tasktracker.reduce.tasks.maximum and
mapred.map.tasks
**
I want my data to be split against only 10 mappers in the entire cluster.
Can I do that using one of the above parameters?


Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
What's the difference between setNumMapTasks and mapred.map.tasks?

On Fri, Mar 9, 2012 at 5:00 PM, Chen He airb...@gmail.com wrote:

 Hi Mohit

  mapred.tasktracker.reduce(map).tasks.maximum  means how many reduce(map)
 slot(s) you can have on each tasktracker.

 mapred.job.reduce(maps) means default number of reduce (map) tasks your
 job will has.

 To set the number of mappers in your application. You can write like this:

 *configuration.setNumMapTasks(the number you want);*

 Chen

 Actually, you can just use configuration.set()

 On Fri, Mar 9, 2012 at 6:42 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  What's the difference between mapred.tasktracker.reduce.tasks.maximum and
  mapred.map.tasks
  **
   I want my data to be split against only 10 mappers in the entire
 cluster.
  Can I do that using one of the above parameters?
 



mapred.tasktracker.map.tasks.maximum not working

2012-03-09 Thread Mohit Anchlia
I have mapred.tasktracker.map.tasks.maximum set to 2 in my job and I have 5
nodes. I was expecting this to have only 10 concurrent jobs. But I have 30
mappers running. Does hadoop ignores this setting when supplied from the
job?


Re: mapred.map.tasks vs mapred.tasktracker.map.tasks.maximum

2012-03-09 Thread Mohit Anchlia
Is this system parameter too? Or can I specify as mapred.map.tasks? I am
using pig.

On Fri, Mar 9, 2012 at 6:19 PM, Chen He airb...@gmail.com wrote:

 if you do not specify  setNumMapTasks, by default, system will use the
 number you configured  for mapred.map.tasks in the conf/mapred-site.xml
 file.

 On Fri, Mar 9, 2012 at 7:19 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  What's the difference between setNumMapTasks and mapred.map.tasks?
 
  On Fri, Mar 9, 2012 at 5:00 PM, Chen He airb...@gmail.com wrote:
 
   Hi Mohit
  
mapred.tasktracker.reduce(map).tasks.maximum  means how many
  reduce(map)
   slot(s) you can have on each tasktracker.
  
   mapred.job.reduce(maps) means default number of reduce (map) tasks
 your
   job will has.
  
   To set the number of mappers in your application. You can write like
  this:
  
   *configuration.setNumMapTasks(the number you want);*
  
   Chen
  
   Actually, you can just use configuration.set()
  
   On Fri, Mar 9, 2012 at 6:42 PM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
What's the difference between mapred.tasktracker.reduce.tasks.maximum
  and
mapred.map.tasks
**
 I want my data to be split against only 10 mappers in the entire
   cluster.
Can I do that using one of the above parameters?
   
  
 



Re: Profiling Hadoop Job

2012-03-08 Thread Mohit Anchlia
Can you check which user you are running this process as and compare it
with the ownership on the directory?

On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina lurb...@mit.edu wrote:

 Does anyone have any idea how to solve this problem? Regardless of whether
 I'm using plain HPROF or profiling through Starfish, I am getting the same
 error:

 Exception in thread main java.io.FileNotFoundException:
 attempt_201203071311_0004_m_
 00_0.profile (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:194)
at java.io.FileOutputStream.init(FileOutputStream.java:84)
at
 org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
at
 org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
at

 com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at

 com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 But I can't find what permissions to change to fix this issue. Any ideas?
 Thanks in advance,

 Best,
 -Leo


  On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina lurb...@mit.edu wrote:

  Thanks,
  -Leo
 
 
  On Wed, Mar 7, 2012 at 3:47 PM, Jie Li ji...@cs.duke.edu wrote:
 
  Hi Leo,
 
  Thanks for pointing out the outdated README file.  Glad to tell you that
  we
  do support the old API in the latest version. See here:
 
  http://www.cs.duke.edu/starfish/previous.html
 
  Welcome to join our mailing list and your questions will reach more of
 our
  group members.
 
  Jie
 
  On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina lurb...@mit.edu
 wrote:
 
   Hi Jie,
  
   According to the Starfish README, the hadoop programs must be written
  using
   the new Hadoop API. This is not my case (I am using MultipleInputs
 among
   other non-new API supported features). Is there any way around this?
   Thanks,
  
   -Leo
  
   On Wed, Mar 7, 2012 at 3:19 PM, Jie Li ji...@cs.duke.edu wrote:
  
Hi Leonardo,
   
You might want to try Starfish which supports the memory profiling
 as
   well
as cpu/disk/network profiling for the performance tuning.
   
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish
   
   
On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina lurb...@mit.edu
  wrote:
   
 Hello everyone,

 I have a Hadoop job that I run on several GBs of data that I am
  trying
   to
 optimize in order to reduce the memory consumption as well as
  improve
   the
 speed. I am following the steps outlined in Tom White's Hadoop:
 The
 Definitive Guide for profiling using HPROF (p161), by setting the
 following properties in the JobConf:

job.setProfileEnabled(true);


  job.setProfileParams(-agentlib:hprof=cpu=samples,heap=sites,depth=6,
   +
force=n,thread=y,verbose=n,file=%s);
job.setProfileTaskRange(true, 0-2);
job.setProfileTaskRange(false, 0-2);

 I am trying to run this locally on a single pseudo-distributed
  install
   of
 hadoop (0.20.2) and it gives the following error:

 Exception in thread main java.io.FileNotFoundException:
 attempt_201203071311_0004_m_00_0.profile (Permission denied)
at java.io.FileOutputStream.open(Native Method)
at
 java.io.FileOutputStream.init(FileOutputStream.java:194)
at
 java.io.FileOutputStream.init(FileOutputStream.java:84)
at

  org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226)
at

   
  
 
 org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302)
at
   org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
at


   
  
 
 com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89)
at
 org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at


   
  
 
 com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
  Method)
at


   
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at


   
  
 
 

Re: Java Heap space error

2012-03-06 Thread Mohit Anchlia
I am still trying to see how to narrow this down. Is it possible to set
heapdumponoutofmemoryerror option on these individual tasks?

On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Sorry for multiple emails. I did find:


 2012-03-05 17:26:35,636 INFO
 org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call-
 Usage threshold init = 715849728(699072K) used = 575921696(562423K)
 committed = 715849728(699072K) max = 715849728(699072K)

 2012-03-05 17:26:35,719 INFO
 org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
 7816154 bytes from 1 objects. init = 715849728(699072K) used =
 575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K)

 2012-03-05 17:26:36,881 INFO
 org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
 - Collection threshold init = 715849728(699072K) used = 358720384(350312K)
 committed = 715849728(699072K) max = 715849728(699072K)

 2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

 2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error
 running child : java.lang.OutOfMemoryError: Java heap space

 at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39)

 at java.nio.CharBuffer.allocate(CharBuffer.java:312)

 at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760)

 at org.apache.hadoop.io.Text.decode(Text.java:350)

 at org.apache.hadoop.io.Text.decode(Text.java:327)

 at org.apache.hadoop.io.Text.toString(Text.java:254)

 at
 org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105)

 at
 org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139)

 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)

 at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)

 at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

 at org.apache.hadoop.mapred.Child.main(Child.java:264)


   On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 All I see in the logs is:


 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
 attempt_201203051722_0001_m_30_1 - Killed : Java heap space

 Looks like task tracker is killing the tasks. Not sure why. I increased
 heap from 512 to 1G and still it fails.


 On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I currently have java.opts.mapred set to 512MB and I am getting heap
 space errors. How should I go about debugging heap space issues?






Re: AWS MapReduce

2012-03-05 Thread Mohit Anchlia
On Mon, Mar 5, 2012 at 7:40 AM, John Conwell j...@iamjohn.me wrote:

 AWS MapReduce (EMR) does not use S3 for its HDFS persistance.  If it did
 your S3 billing would be massive :)  EMR reads all input jar files and
 input data from S3, but it copies these files down to its local disk.  It
 then does starts the MR process, doing all HDFS reads and writes to the
 local disks.  At the end of the MR job, it copies the MR job output and all
 process logs to S3, and then tears down the VM instances.

 You can see this for yourself if you spin up a small EMR cluster, but turn
 off the configuration flag that kills the VMs at the end if the MR job.
  Then look at the hadoop configuration files to see how hadoop is
 configured.

 I really like EMR.  Amazon  has done a lot of work to optimize the hadoop
 configurations and VM instance AMIs to execute MR jobs fairly efficiently
 on a VM cluster.  I had to do a lot of (expensive) trial and error work to
 figure out an optimal hadoop / VM configuration to run our MR jobs without
 crashing / timing out the jobs.  The only reason we didnt standardize on
 EMR was that it strongly bound your code base / process to using EMR for
 hadoop processing, vs a flexible infrastructure that could use a local
 cluster or cluster on a different cloud provider.

 Thanks for your input. I am assuming HDFS is created on ephemerial disks
and not EBS. Also, is it possible to share some of your findings?


 On Sun, Mar 4, 2012 at 8:51 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  As far as I see in the docs it looks like you could also use hdfs instead
  of s3. But what I am not sure is if these are local disks or EBS.
 
  On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer 
  hannesc...@googlemail.com
   wrote:
 
   Hi,
  
   yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow.
   The setup is done pretty fast and there are some configuration
 parameters
   you can bypass - for example blocksizes etc. - but in the end imho
  setting
   up ec2 instances by copying images is the better alternative.
  
   Kind Regards
  
   Hannes
  
   On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
I think found answer to this question. However, it's still not clear
 if
HDFS is on local disk or EBS volumes. Does anyone know?
   
On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia 
 mohitanch...@gmail.com
wrote:
   
 Just want to check  how many are using AWS mapreduce and understand
  the
 pros and cons of Amazon's MapReduce machines? Is it true that these
  map
 reduce machines are really reading and writing from S3 instead of
  local
 disks? Has anyone found issues with Amazon MapReduce and how does
 it
 compare with using MapReduce on local attached disks compared to
  using
S3.
   
  
   ---
   www.informera.de
   Hadoop  Big Data Services
  
 



 --

 Thanks,
 John C



Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
All I see in the logs is:


2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
attempt_201203051722_0001_m_30_1 - Killed : Java heap space

Looks like task tracker is killing the tasks. Not sure why. I increased
heap from 512 to 1G and still it fails.


On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I currently have java.opts.mapred set to 512MB and I am getting heap space
 errors. How should I go about debugging heap space issues?



Re: Java Heap space error

2012-03-05 Thread Mohit Anchlia
Sorry for multiple emails. I did find:


2012-03-05 17:26:35,636 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call-
Usage threshold init = 715849728(699072K) used = 575921696(562423K)
committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:35,719 INFO
org.apache.pig.impl.util.SpillableMemoryManager: Spilled an estimate of
7816154 bytes from 1 objects. init = 715849728(699072K) used =
575921696(562423K) committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:36,881 INFO
org.apache.pig.impl.util.SpillableMemoryManager: first memory handler call
- Collection threshold init = 715849728(699072K) used = 358720384(350312K)
committed = 715849728(699072K) max = 715849728(699072K)

2012-03-05 17:26:36,885 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1

2012-03-05 17:26:36,888 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: Java heap space

at java.nio.HeapCharBuffer.init(HeapCharBuffer.java:39)

at java.nio.CharBuffer.allocate(CharBuffer.java:312)

at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:760)

at org.apache.hadoop.io.Text.decode(Text.java:350)

at org.apache.hadoop.io.Text.decode(Text.java:327)

at org.apache.hadoop.io.Text.toString(Text.java:254)

at
org.apache.pig.piggybank.storage.SequenceFileLoader.translateWritableToPigDataType(SequenceFileLoader.java:105)

at
org.apache.pig.piggybank.storage.SequenceFileLoader.getNext(SequenceFileLoader.java:139)

at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:187)

at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)

at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)

at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:396)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

at org.apache.hadoop.mapred.Child.main(Child.java:264)


On Mon, Mar 5, 2012 at 5:46 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 All I see in the logs is:


 2012-03-05 17:26:36,889 FATAL org.apache.hadoop.mapred.TaskTracker: Task:
 attempt_201203051722_0001_m_30_1 - Killed : Java heap space

 Looks like task tracker is killing the tasks. Not sure why. I increased
 heap from 512 to 1G and still it fails.


 On Mon, Mar 5, 2012 at 5:03 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I currently have java.opts.mapred set to 512MB and I am getting heap
 space errors. How should I go about debugging heap space issues?





Re: AWS MapReduce

2012-03-04 Thread Mohit Anchlia
As far as I see in the docs it looks like you could also use hdfs instead
of s3. But what I am not sure is if these are local disks or EBS.

On Sun, Mar 4, 2012 at 2:27 AM, Hannes Carl Meyer hannesc...@googlemail.com
 wrote:

 Hi,

 yes, its loaded from S3. Imho is Amazon AWS Map-Reduce pretty slow.
 The setup is done pretty fast and there are some configuration parameters
 you can bypass - for example blocksizes etc. - but in the end imho setting
 up ec2 instances by copying images is the better alternative.

 Kind Regards

 Hannes

 On Sun, Mar 4, 2012 at 2:31 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I think found answer to this question. However, it's still not clear if
  HDFS is on local disk or EBS volumes. Does anyone know?
 
  On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   Just want to check  how many are using AWS mapreduce and understand the
   pros and cons of Amazon's MapReduce machines? Is it true that these map
   reduce machines are really reading and writing from S3 instead of local
   disks? Has anyone found issues with Amazon MapReduce and how does it
   compare with using MapReduce on local attached disks compared to using
  S3.
 

 ---
 www.informera.de
 Hadoop  Big Data Services



Re: AWS MapReduce

2012-03-03 Thread Mohit Anchlia
I think found answer to this question. However, it's still not clear if
HDFS is on local disk or EBS volumes. Does anyone know?

On Sat, Mar 3, 2012 at 3:54 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Just want to check  how many are using AWS mapreduce and understand the
 pros and cons of Amazon's MapReduce machines? Is it true that these map
 reduce machines are really reading and writing from S3 instead of local
 disks? Has anyone found issues with Amazon MapReduce and how does it
 compare with using MapReduce on local attached disks compared to using S3.


Re: Hadoop pain points?

2012-03-02 Thread Mohit Anchlia
+1

On Fri, Mar 2, 2012 at 4:09 PM, Harsh J ha...@cloudera.com wrote:

 Since you ask about anything in general, when I forayed into using
 Hadoop, my biggest pain was lack of documentation clarity and
 completeness over the MR and DFS user APIs (and other little points).

 It would be nice to have some work done to have one example or
 semi-example for every single Input/OutputFormat, Mapper/Reducer
 implementations, etc. added to the javadocs.

 I believe examples and snippets help out a ton (tons more than
 explaining just behavior) to new devs.

 On Fri, Mar 2, 2012 at 9:45 PM, Kunaal kunalbha...@alumni.cmu.edu wrote:
  I am doing a general poll on what are the most prevalent pain points that
  people run into with Hadoop? These could be performance related (memory
  usage, IO latencies), usage related or anything really.
 
  The goal is to look for what areas this platform could benefit the most
 in
  the near future.
 
  Any feedback is much appreciated.
 
  Thanks,
  Kunal.



 --
 Harsh J



kill -QUIT

2012-03-01 Thread Mohit Anchlia
When I try kill -QUIT for a job it doesn't send the stacktrace to the log
files. Does anyone know why or if I am doing something wrong?

I find the job using ps -ef|grep attempt. I then go to
logs/userLogs/jobid/attemptid/


Adding nodes

2012-03-01 Thread Mohit Anchlia
Is this the right procedure to add nodes? I took some from hadoop wiki FAQ:

http://wiki.apache.org/hadoop/FAQ

1. Update conf/slave
2. on the slave nodes start datanode and tasktracker
3. hadoop balancer

Do I also need to run dfsadmin -refreshnodes?


Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote:

 You only have to refresh nodes if you're making use of an allows file.

 Thanks does it mean that when tasktracker/datanode starts up it
communicates with namenode using master file?

Sent from my iPhone

 On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:

  Is this the right procedure to add nodes? I took some from hadoop wiki
 FAQ:
 
  http://wiki.apache.org/hadoop/FAQ
 
  1. Update conf/slave
  2. on the slave nodes start datanode and tasktracker
  3. hadoop balancer
 
  Do I also need to run dfsadmin -refreshnodes?



Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote:

 Not quite. Datanodes get the namenode host from fs.defalt.name in
 core-site.xml. Task trackers find the job tracker from the
 mapred.job.tracker setting in mapred-site.xml.


I actually meant to ask how does namenode/jobtracker know there is a new
node in the cluster. Is it initiated by namenode when slave file is edited?
Or is it initiated by tasktracker when tasktracker is started?


 Sent from my iPhone

 On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:

  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com
 wrote:
 
  You only have to refresh nodes if you're making use of an allows file.
 
  Thanks does it mean that when tasktracker/datanode starts up it
  communicates with namenode using master file?
 
  Sent from my iPhone
 
  On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:
 
  Is this the right procedure to add nodes? I took some from hadoop wiki
  FAQ:
 
  http://wiki.apache.org/hadoop/FAQ
 
  1. Update conf/slave
  2. on the slave nodes start datanode and tasktracker
  3. hadoop balancer
 
  Do I also need to run dfsadmin -refreshnodes?
 



Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
Thanks all for the answers!!

On Thu, Mar 1, 2012 at 5:52 PM, Arpit Gupta ar...@hortonworks.com wrote:

 It is initiated by the slave.

 If you have defined files to state which slaves can talk to the namenode
 (using config dfs.hosts) and which hosts cannot (using
 property dfs.hosts.exclude) then you would need to edit these files and
 issue the refresh command.


  On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote:

  On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com
 wrote:

 Not quite. Datanodes get the namenode host from fs.defalt.name in

 core-site.xml. Task trackers find the job tracker from the

 mapred.job.tracker setting in mapred-site.xml.



 I actually meant to ask how does namenode/jobtracker know there is a new
 node in the cluster. Is it initiated by namenode when slave file is edited?
 Or is it initiated by tasktracker when tasktracker is started?


 Sent from my iPhone


 On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:


  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com

 wrote:


  You only have to refresh nodes if you're making use of an allows file.


  Thanks does it mean that when tasktracker/datanode starts up it

  communicates with namenode using master file?


  Sent from my iPhone


  On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:


   Is this the right procedure to add nodes? I took some from hadoop wiki

  FAQ:


   http://wiki.apache.org/hadoop/FAQ


   1. Update conf/slave

   2. on the slave nodes start datanode and tasktracker

   3. hadoop balancer


   Do I also need to run dfsadmin -refreshnodes?





 --
 Arpit
 Hortonworks, Inc.
 email: ar...@hortonworks.com

 http://www.hadoopsummit.org/
  http://www.hadoopsummit.org/
 http://www.hadoopsummit.org/



Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
I am going to try few things today. I have a JAXBContext object that
marshals the xml, this is static instance but my guess at this point is
that since this is in separate jar then the one where job runs and I used
DistributeCache.addClassPath this context is being created on every call
for some reason. I don't know why that would be. I am going to create this
instance as static in the mapper class itself and see if that helps. I also
add debugs. Will post the results after try it out.

On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 It would be great if we can take a look at what you are doing in the UDF vs
 the Mapper.

 100x slow does not make sense for the same job/logic, its either the Mapper
 code or may be the cluster was busy at the time you scheduled MapReduce
 job?

 Thanks,
 Prashant

 On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I am comparing runtime of similar logic. The entire logic is exactly same
  but surprisingly map reduce job that I submit is 100x slow. For pig I use
  udf and for hadoop I use mapper only and the logic same as pig. Even the
  splits on the admin page are same. Not sure why it's so slow. I am
  submitting job like:
 
  java -classpath
 
 
 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
  com.services.dp.analytics.hadoop.mapred.FormMLProcessor
 
 
 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
  /examples/output1/
 
  How should I go about looking the root cause of why it's so slow? Any
  suggestions would be really appreciated.
 
 
 
  One of the things I noticed is that on the admin page of map task list I
  see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728
 but
  for pig the status is blank.
 



Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
I think I've found the problem. There was one line of code that caused this
issue :)  that was output.collect(key, value);

I had to add more logging to the code to get to it. For some reason kill
-QUIT didn't send the stacktrace to the userLogs/job/attempt/syslog , I
searched all the logs and couldn't find one. Does anyone know where
stacktraces are generally sent?

On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I can't seem to find what's causing this slowness. Nothing in the logs.
 It's just painfuly slow. However, pig job is awesome in performance that
 has the same logic. Here is the mapper code and the pig code:


 *public* *static* *class* Map *extends* MapReduceBase

 *implements* MapperText, Text, Text, Text {

 *public* *void* map(Text key, Text value,

 OutputCollectorText, Text output,

 Reporter reporter)
 *throws* IOException {

 String line = value.toString();

 //log.info(output key: + key + value  + value + value  + line);

 FormMLType f;

 *try* {

 f = FormMLUtils.*convertToRows*(line);

 FormMLStack fm =
 *new* FormMLStack(f,key.toString());

 fm.parseFormML();

 *for* (String *row* : fm.getFormattedRecords(*false*)){

 output.collect(key, value);

 }

 }
 *catch* (JAXBException e) {

 *log*.error(Error processing record  + key, e);

 }

  }

 }

 And here is the pig udf:


 *public* DataBag exec(Tuple input) *throws* IOException {

 *try* {

 DataBag output =
 mBagFactory.newDefaultBag();

 Object o = input.get(1);

 *if* (!(o *instanceof* String)) {

 *throw* *new* IOException(

 Expected document input to be chararray, but got 

 + o.getClass().getName());

 }

 Object o1 = input.get(0);

 *if* (!(o1 *instanceof* String)) {

 *throw* *new* IOException(

 Expected input to be chararray, but got 

 + o.getClass().getName());

 }

 String document = (String)o;

 String filename = (String)o1;

 FormMLType f = FormMLUtils.*convertToRows*(document);

 FormMLStack fm =
 *new* FormMLStack(f,filename);

 fm.parseFormML();

 *for* (String row : fm.getFormattedRecords(*false*)){

 output.add(
 mTupleFactory.newTuple(row));

 }

 *return* output;

 }
 *catch* (ExecException ee) {

 log.error(Failed to Process , ee);

 *throw* ee;

 }
 *catch* (JAXBException e) {

 // *TODO* Auto-generated catch block

 log.error(Invalid xml, e);

 *throw* *new* IllegalArgumentException(invalid xml  +
 e.getCause().getMessage());

 }

 }

   On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia 
 mohitanch...@gmail.comwrote:

 I am going to try few things today. I have a JAXBContext object that
 marshals the xml, this is static instance but my guess at this point is
 that since this is in separate jar then the one where job runs and I used
 DistributeCache.addClassPath this context is being created on every call
 for some reason. I don't know why that would be. I am going to create this
 instance as static in the mapper class itself and see if that helps. I also
 add debugs. Will post the results after try it out.


 On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.com
  wrote:

 It would be great if we can take a look at what you are doing in the UDF
 vs
 the Mapper.

 100x slow does not make sense for the same job/logic, its either the
 Mapper
 code or may be the cluster was busy at the time you scheduled MapReduce
 job?

 Thanks,
 Prashant

 On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I am comparing runtime of similar logic. The entire logic is exactly
 same
  but surprisingly map reduce job that I submit is 100x slow. For pig I
 use
  udf and for hadoop I use mapper only and the logic same as pig. Even
 the
  splits on the admin page are same. Not sure why it's so slow. I am
  submitting job like:
 
  java -classpath
 
 
 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
  com.services.dp.analytics.hadoop.mapred.FormMLProcessor
 
 
 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
  /examples/output1/
 
  How should I go about looking the root cause of why it's so slow? Any
  suggestions would be really appreciated.
 
 
 
  One of the things I noticed is that on the admin page of map task list
 I
  see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728
 but
  for pig the status is blank.
 






Re: Invocation exception

2012-02-29 Thread Mohit Anchlia
Thanks for the example. I did look at the logs and also at the admin page
and all I see is the exception that I posted initially.

I am not sure why adding an extra jar to the classpath in DistributedCache
causes that exception. I tried to look at Configuration code in hadoop.util
package but it doesn't tell much. It looks like it's throwing on this line
configureMethod.invoke(theObject, conf); in below code.


*private* *static* *void* setJobConf(Object theObject, Configuration conf) {

//If JobConf and JobConfigurable are in classpath, AND

//theObject is of type JobConfigurable AND

//conf is of type JobConf then

//invoke configure on theObject

*try* {

Class? jobConfClass =

conf.getClassByName(org.apache.hadoop.mapred.JobConf);

Class? jobConfigurableClass =

conf.getClassByName(org.apache.hadoop.mapred.JobConfigurable);

*if* (jobConfClass.isAssignableFrom(conf.getClass()) 

jobConfigurableClass.isAssignableFrom(theObject.getClass())) {

Method configureMethod =

jobConfigurableClass.getMethod(configure, jobConfClass);

configureMethod.invoke(theObject, conf);

}

} *catch* (ClassNotFoundException e) {

//JobConf/JobConfigurable not in classpath. no need to configure

} *catch* (Exception e) {

*throw* *new* RuntimeException(Error in configuring object, e);

}

}

On Tue, Feb 28, 2012 at 9:25 PM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 If you visit the failed task attempt on the JT Web UI, you can see the
 complete, informative stack trace on it. It would point the exact line
 the trouble came up in and what the real error during the
 configure-phase of task initialization was.

 A simple attempts page goes like the following (replace job ID and
 task ID of course):


 http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964tipid=task_201202041249_3964_m_00

 Once there, find and open the All logs link to see stdout, stderr,
 and syslog of the specific failed task attempt. You'll have more info
 sifting through this to debug your issue.

 This is also explained in Tom's book under the title Debugging a Job
 (p154, Hadoop: The Definitive Guide, 2nd ed.).

 On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  It looks like adding this line causes invocation exception. I looked in
  hdfs and I see that file in that path
 
   DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
 conf);
 
  I have similar code for another jar
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf); but this works just fine.
 
 
  On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  I commented reducer and combiner both and still I see the same
 exception.
  Could it be because I have 2 jars being added?
 
   On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com
 wrote:
 
  On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   For some reason I am getting invocation exception and I don't see any
  more
   details other than this exception:
  
   My job is configured as:
  
  
   JobConf conf = *new* JobConf(FormMLProcessor.*class*);
  
   conf.addResource(hdfs-site.xml);
  
   conf.addResource(core-site.xml);
  
   conf.addResource(mapred-site.xml);
  
   conf.set(mapred.reduce.tasks, 0);
  
   conf.setJobName(mlprocessor);
  
   DistributedCache.*addFileToClassPath*(*new*
 Path(/jars/analytics.jar),
   conf);
  
   DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
   conf);
  
   conf.setOutputKeyClass(Text.*class*);
  
   conf.setOutputValueClass(Text.*class*);
  
   conf.setMapperClass(Map.*class*);
  
   conf.setCombinerClass(Reduce.*class*);
  
   conf.setReducerClass(IdentityReducer.*class*);
  
 
  Why would you set the Reducer when the number of reducers is set to
 zero.
  Not sure if this is the real cause.
 
 
  
   conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
  
   conf.setOutputFormat(TextOutputFormat.*class*);
  
   FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
  
   FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
  
   JobClient.*runJob*(conf);
  
   -
   *
  
   java.lang.RuntimeException*: Error in configuring object
  
   at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
   ReflectionUtils.java:93*)
  
   at
  
 
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
  
   at org.apache.hadoop.util.ReflectionUtils.newInstance(*
   ReflectionUtils.java:117*)
  
   at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
  
   at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
  
   at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
  
   at java.security.AccessController.doPrivileged(*Native Method*)
  
   at javax.security.auth.Subject.doAs(*Subject.java:396*)
  
   at org.apache.hadoop.security.UserGroupInformation.doAs(*
   UserGroupInformation.java:1157*)
  
   at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
  
   Caused

Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
I commented reducer and combiner both and still I see the same exception.
Could it be because I have 2 jars being added?

On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote:

 On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  For some reason I am getting invocation exception and I don't see any
 more
  details other than this exception:
 
  My job is configured as:
 
 
  JobConf conf = *new* JobConf(FormMLProcessor.*class*);
 
  conf.addResource(hdfs-site.xml);
 
  conf.addResource(core-site.xml);
 
  conf.addResource(mapred-site.xml);
 
  conf.set(mapred.reduce.tasks, 0);
 
  conf.setJobName(mlprocessor);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
  conf);
 
  conf.setOutputKeyClass(Text.*class*);
 
  conf.setOutputValueClass(Text.*class*);
 
  conf.setMapperClass(Map.*class*);
 
  conf.setCombinerClass(Reduce.*class*);
 
  conf.setReducerClass(IdentityReducer.*class*);
 

 Why would you set the Reducer when the number of reducers is set to zero.
 Not sure if this is the real cause.


 
  conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
 
  conf.setOutputFormat(TextOutputFormat.*class*);
 
  FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
 
  FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
 
  JobClient.*runJob*(conf);
 
  -
  *
 
  java.lang.RuntimeException*: Error in configuring object
 
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
  ReflectionUtils.java:93*)
 
  at
  org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
 
  at org.apache.hadoop.util.ReflectionUtils.newInstance(*
  ReflectionUtils.java:117*)
 
  at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
 
  at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
 
  at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
 
  at java.security.AccessController.doPrivileged(*Native Method*)
 
  at javax.security.auth.Subject.doAs(*Subject.java:396*)
 
  at org.apache.hadoop.security.UserGroupInformation.doAs(*
  UserGroupInformation.java:1157*)
 
  at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
 
  Caused by: *java.lang.reflect.InvocationTargetException
  *
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(*
  NativeMethodAccessorImpl.java:39*)
 
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
 



Re: Invocation exception

2012-02-28 Thread Mohit Anchlia
It looks like adding this line causes invocation exception. I looked in
hdfs and I see that file in that path

DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar), conf);

I have similar code for another jar
DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
conf); but this works just fine.


On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I commented reducer and combiner both and still I see the same exception.
 Could it be because I have 2 jars being added?

  On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.comwrote:

 On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  For some reason I am getting invocation exception and I don't see any
 more
  details other than this exception:
 
  My job is configured as:
 
 
  JobConf conf = *new* JobConf(FormMLProcessor.*class*);
 
  conf.addResource(hdfs-site.xml);
 
  conf.addResource(core-site.xml);
 
  conf.addResource(mapred-site.xml);
 
  conf.set(mapred.reduce.tasks, 0);
 
  conf.setJobName(mlprocessor);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
  conf);
 
  conf.setOutputKeyClass(Text.*class*);
 
  conf.setOutputValueClass(Text.*class*);
 
  conf.setMapperClass(Map.*class*);
 
  conf.setCombinerClass(Reduce.*class*);
 
  conf.setReducerClass(IdentityReducer.*class*);
 

 Why would you set the Reducer when the number of reducers is set to zero.
 Not sure if this is the real cause.


 
  conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
 
  conf.setOutputFormat(TextOutputFormat.*class*);
 
  FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
 
  FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
 
  JobClient.*runJob*(conf);
 
  -
  *
 
  java.lang.RuntimeException*: Error in configuring object
 
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
  ReflectionUtils.java:93*)
 
  at
 
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
 
  at org.apache.hadoop.util.ReflectionUtils.newInstance(*
  ReflectionUtils.java:117*)
 
  at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
 
  at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
 
  at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
 
  at java.security.AccessController.doPrivileged(*Native Method*)
 
  at javax.security.auth.Subject.doAs(*Subject.java:396*)
 
  at org.apache.hadoop.security.UserGroupInformation.doAs(*
  UserGroupInformation.java:1157*)
 
  at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
 
  Caused by: *java.lang.reflect.InvocationTargetException
  *
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(*
  NativeMethodAccessorImpl.java:39*)
 
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
 





100x slower mapreduce compared to pig

2012-02-28 Thread Mohit Anchlia
I am comparing runtime of similar logic. The entire logic is exactly same
but surprisingly map reduce job that I submit is 100x slow. For pig I use
udf and for hadoop I use mapper only and the logic same as pig. Even the
splits on the admin page are same. Not sure why it's so slow. I am
submitting job like:

java -classpath
.:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
com.services.dp.analytics.hadoop.mapred.FormMLProcessor
/examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
/examples/output1/

How should I go about looking the root cause of why it's so slow? Any
suggestions would be really appreciated.



One of the things I noticed is that on the admin page of map task list I
see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728 but
for pig the status is blank.


Re: dfs.block.size

2012-02-27 Thread Mohit Anchlia
Can someone please suggest if parameters like dfs.block.size,
mapred.tasktracker.map.tasks.maximum are only cluster wide settings or can
these be set per client job configuration?

On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 If I want to change the block size then can I use Configuration in
 mapreduce job and set it when writing to the sequence file or does it need
 to be cluster wide setting in .xml files?

 Also, is there a way to check the block of a given file?



Task Killed but no errors

2012-02-27 Thread Mohit Anchlia
I submitted a map reduce job that had 9 tasks killed out of 139. But I
don't see any errors in the admin page. The entire job however has
SUCCEDED. How can I track down the reason?

Also, how do I determine if this is something to worry about?


Re: dfs.block.size

2012-02-27 Thread Mohit Anchlia
How do I verify the block size of a given file? Is there a command?

On Mon, Feb 27, 2012 at 7:59 AM, Joey Echeverria j...@cloudera.com wrote:

 dfs.block.size can be set per job.

 mapred.tasktracker.map.tasks.maximum is per tasktracker.

 -Joey

 On Mon, Feb 27, 2012 at 10:19 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  Can someone please suggest if parameters like dfs.block.size,
  mapred.tasktracker.map.tasks.maximum are only cluster wide settings or
 can
  these be set per client job configuration?
 
  On Sat, Feb 25, 2012 at 5:43 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  If I want to change the block size then can I use Configuration in
  mapreduce job and set it when writing to the sequence file or does it
 need
  to be cluster wide setting in .xml files?
 
  Also, is there a way to check the block of a given file?
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434



Handling bad records

2012-02-27 Thread Mohit Anchlia
What's the best way to write records to a different file? I am doing xml
processing and during processing I might come accross invalid xml format.
Current I have it under try catch block and writing to log4j. But I think
it would be better to just write it to an output file that just contains
errors.


Re: Invocation exception

2012-02-27 Thread Mohit Anchlia
Does it matter if reducer is set even if the no of reducers is 0? Is there
a way to get more clear reason?

On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com wrote:

 On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  For some reason I am getting invocation exception and I don't see any
 more
  details other than this exception:
 
  My job is configured as:
 
 
  JobConf conf = *new* JobConf(FormMLProcessor.*class*);
 
  conf.addResource(hdfs-site.xml);
 
  conf.addResource(core-site.xml);
 
  conf.addResource(mapred-site.xml);
 
  conf.set(mapred.reduce.tasks, 0);
 
  conf.setJobName(mlprocessor);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
  conf);
 
  conf.setOutputKeyClass(Text.*class*);
 
  conf.setOutputValueClass(Text.*class*);
 
  conf.setMapperClass(Map.*class*);
 
  conf.setCombinerClass(Reduce.*class*);
 
  conf.setReducerClass(IdentityReducer.*class*);
 

 Why would you set the Reducer when the number of reducers is set to zero.
 Not sure if this is the real cause.


 
  conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
 
  conf.setOutputFormat(TextOutputFormat.*class*);
 
  FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
 
  FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
 
  JobClient.*runJob*(conf);
 
  -
  *
 
  java.lang.RuntimeException*: Error in configuring object
 
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
  ReflectionUtils.java:93*)
 
  at
  org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
 
  at org.apache.hadoop.util.ReflectionUtils.newInstance(*
  ReflectionUtils.java:117*)
 
  at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
 
  at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
 
  at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
 
  at java.security.AccessController.doPrivileged(*Native Method*)
 
  at javax.security.auth.Subject.doAs(*Subject.java:396*)
 
  at org.apache.hadoop.security.UserGroupInformation.doAs(*
  UserGroupInformation.java:1157*)
 
  at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
 
  Caused by: *java.lang.reflect.InvocationTargetException
  *
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(*
  NativeMethodAccessorImpl.java:39*)
 
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
 



Re: Invocation exception

2012-02-27 Thread Mohit Anchlia
On Mon, Feb 27, 2012 at 8:58 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Tom White's Definitive Guide book is a great reference. Answers to
 most of your questions could be found there.

 I've been through that book but haven't come accross how to debug this
exception. Can you point me to the topic in that book where I'll find this
information?


 Sent from my iPhone

 On Feb 27, 2012, at 8:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

  Does it matter if reducer is set even if the no of reducers is 0? Is
 there
  a way to get more clear reason?
 
  On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com
 wrote:
 
  On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
  For some reason I am getting invocation exception and I don't see any
  more
  details other than this exception:
 
  My job is configured as:
 
 
  JobConf conf = *new* JobConf(FormMLProcessor.*class*);
 
  conf.addResource(hdfs-site.xml);
 
  conf.addResource(core-site.xml);
 
  conf.addResource(mapred-site.xml);
 
  conf.set(mapred.reduce.tasks, 0);
 
  conf.setJobName(mlprocessor);
 
  DistributedCache.*addFileToClassPath*(*new*
 Path(/jars/analytics.jar),
  conf);
 
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
  conf);
 
  conf.setOutputKeyClass(Text.*class*);
 
  conf.setOutputValueClass(Text.*class*);
 
  conf.setMapperClass(Map.*class*);
 
  conf.setCombinerClass(Reduce.*class*);
 
  conf.setReducerClass(IdentityReducer.*class*);
 
 
  Why would you set the Reducer when the number of reducers is set to
 zero.
  Not sure if this is the real cause.
 
 
 
  conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
 
  conf.setOutputFormat(TextOutputFormat.*class*);
 
  FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
 
  FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
 
  JobClient.*runJob*(conf);
 
  -
  *
 
  java.lang.RuntimeException*: Error in configuring object
 
  at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
  ReflectionUtils.java:93*)
 
  at
 
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
 
  at org.apache.hadoop.util.ReflectionUtils.newInstance(*
  ReflectionUtils.java:117*)
 
  at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
 
  at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
 
  at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
 
  at java.security.AccessController.doPrivileged(*Native Method*)
 
  at javax.security.auth.Subject.doAs(*Subject.java:396*)
 
  at org.apache.hadoop.security.UserGroupInformation.doAs(*
  UserGroupInformation.java:1157*)
 
  at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
 
  Caused by: *java.lang.reflect.InvocationTargetException
  *
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(*Native Method*)
 
  at sun.reflect.NativeMethodAccessorImpl.invoke(*
  NativeMethodAccessorImpl.java:39*)
 
  at
 
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
 
 



Re: Handling bad records

2012-02-27 Thread Mohit Anchlia
Thanks that's helpful. In that example what is A and B referring to? Is
that the output file name?

mos.getCollector(seq, A, reporter).collect(key, new Text(Bye));
mos.getCollector(seq, B, reporter).collect(key, new Text(Chau));


On Mon, Feb 27, 2012 at 9:53 PM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 Use the MultipleOutputs API:

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
 to have a named output of bad records. There is an example of use
 detailed on the link.

 On Tue, Feb 28, 2012 at 3:48 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  What's the best way to write records to a different file? I am doing xml
  processing and during processing I might come accross invalid xml format.
  Current I have it under try catch block and writing to log4j. But I think
  it would be better to just write it to an output file that just contains
  errors.



 --
 Harsh J



Re: LZO with sequenceFile

2012-02-26 Thread Mohit Anchlia
On Sun, Feb 26, 2012 at 9:09 AM, Harsh J ha...@cloudera.com wrote:

 If you want to just quickly package the hadoop-lzo items instead of
 building/managing-deployment on your own, you can reuse Todd Lipcon's
 script at https://github.com/toddlipcon/hadoop-lzo-packager - Creates
 both RPMs and DEBs.


Thanks! Some questions I have is:
1. Would it work with sequence files? I am using
SequenceFileAsTextInputStream
2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still
split the files?
3. I am also using CDH's 20.2 version of hadoop.



 On Sun, Feb 26, 2012 at 9:55 PM, Ioan Eugen Stan stan.ieu...@gmail.com
 wrote:
  2012/2/26 Mohit Anchlia mohitanch...@gmail.com:
  Thanks. Does it mean LZO is not installed by default? How can I install
 LZO?
 
  The LZO library is released under GPL and I believe it can't be
  included in most distributions of Hadoop because of this (can't mix
  GPL with non GPL stuff). It should be easily available though.
 
  On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote:
 
  Yes, it is supported by Hadoop sequence file. It is splittable
  by default. If you have installed and specified LZO correctly,
  use these:
 
 
  org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
  t.setCompressOutput(job,true);
 
  org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
  t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
  odec.class);
 
  org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
  t.setOutputCompressionType(job,
  SequenceFile.CompressionType.BLOCK);
 
  job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
  t.SequenceFileOutputFormat.class);
 
 
  Shi
 
 
 
 
  --
  Ioan Eugen Stan
  http://ieugen.blogspot.com/



 --
 Harsh J



dfs.block.size

2012-02-25 Thread Mohit Anchlia
If I want to change the block size then can I use Configuration in
mapreduce job and set it when writing to the sequence file or does it need
to be cluster wide setting in .xml files?

Also, is there a way to check the block of a given file?


Re: LZO with sequenceFile

2012-02-25 Thread Mohit Anchlia
Thanks. Does it mean LZO is not installed by default? How can I install LZO?

On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu sh...@uchicago.edu wrote:

 Yes, it is supported by Hadoop sequence file. It is splittable
 by default. If you have installed and specified LZO correctly,
 use these:


 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setCompressOutput(job,true);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC
 odec.class);

 org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma
 t.setOutputCompressionType(job,
 SequenceFile.CompressionType.BLOCK);

 job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu
 t.SequenceFileOutputFormat.class);


 Shi



MapReduce tunning

2012-02-24 Thread Mohit Anchlia
I am looking at some hadoop tuning parameters like io.sort.mb,
mapred.child.javaopts etc.

- My question was where to look at for current setting
- Are these settings configured cluster wide or per job?
- What's the best way to look at reasons of slow performance?


Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote:

 Hi Mohit
AFAIK there is no default mechanism available for the same in
 hadoop. File is split into blocks just based on the configured block size
 during hdfs copy. While processing the file using Mapreduce the record
 reader takes care of the new lines even if a line spans across multiple
 blocks.

 Could you explain more on the use case that demands such a requirement
 while hdfs copy itself?


 I am using pig's XMLLoader in piggybank to read xml files concatenated in
a text file. But pig script doesn't work when file is big that causes
hadoop to split the files.

Any suggestions on how I can make it work? Below is my simple script that I
would like to enhance, only if it starts working. Please note this works
for small files.


register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

raw = LOAD '/examples/testfile5.txt using
org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);

dump raw;


 --Original Message--
 From: Mohit Anchlia
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: Splitting files on new line using hadoop fs
 Sent: Feb 23, 2012 01:45

 How can I copy large text files using hadoop fs such that split occurs
 based on blocks + new lines instead of blocks alone? Is there a way to do
 this?



 Regards
 Bejoy K S

 From handheld, Please excuse typos.



Re: Splitting files on new line using hadoop fs

2012-02-22 Thread Mohit Anchlia
Thanks I did post this question to that group. All xml document are
separated by a new line so that shouldn't be the issue, I think.

On Wed, Feb 22, 2012 at 12:44 PM, bejoy.had...@gmail.com wrote:

 **
 Hi Mohit
 I'm not an expert in pig and it'd be better using the pig user group for
 pig specific queries. I'd try to help you with some basic trouble shooting
 of the same

 It sounds strange that pig's XML Loader can't load larger XML files that
 consists of multiple blocks. Or is it like, pig is not able to load the
 concatenated files that you are trying with? If that is the case then it
 could be because of some issues since you are just appending multiple xml
 file contents into a single file.

 Pig users can give you some workarounds how they are dealing with loading
 of small xml files that are stored efficiently.

 Regards
 Bejoy K S

 From handheld, Please excuse typos.
 --
 *From: *Mohit Anchlia mohitanch...@gmail.com
 *Date: *Wed, 22 Feb 2012 12:29:26 -0800
 *To: *common-user@hadoop.apache.org; bejoy.had...@gmail.com
 *Subject: *Re: Splitting files on new line using hadoop fs


 On Wed, Feb 22, 2012 at 12:23 PM, bejoy.had...@gmail.com wrote:

 Hi Mohit
AFAIK there is no default mechanism available for the same in
 hadoop. File is split into blocks just based on the configured block size
 during hdfs copy. While processing the file using Mapreduce the record
 reader takes care of the new lines even if a line spans across multiple
 blocks.

 Could you explain more on the use case that demands such a requirement
 while hdfs copy itself?


  I am using pig's XMLLoader in piggybank to read xml files concatenated
 in a text file. But pig script doesn't work when file is big that causes
 hadoop to split the files.

 Any suggestions on how I can make it work? Below is my simple script that
 I would like to enhance, only if it starts working. Please note this works
 for small files.


 register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

 raw = LOAD '/examples/testfile5.txt using
 org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);

 dump raw;


 --Original Message--
 From: Mohit Anchlia
 To: common-user@hadoop.apache.org
 ReplyTo: common-user@hadoop.apache.org
 Subject: Splitting files on new line using hadoop fs
 Sent: Feb 23, 2012 01:45

 How can I copy large text files using hadoop fs such that split occurs
 based on blocks + new lines instead of blocks alone? Is there a way to do
 this?



 Regards
 Bejoy K S

 From handheld, Please excuse typos.





Streaming job hanging

2012-02-22 Thread Mohit Anchlia
Streaming job just seems to be hanging

12/02/22 17:35:50 INFO streaming.StreamJob: map 0% reduce 0%

-

On the admin page I see that it created 551 input split. Could somone
suggest a way to find out what might be causing it to hang? I increased
io.sort.mb to 200 MB.

I am using 5 data nodes with 12 CPU, 96G RAM.


Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Mohit
   Rather than just appending the content into a normal text file or
 so, you can create a sequence file with the individual smaller file content
 as values.

  Thanks. I was planning to use pig's 
 org.apache.pig.piggybank.storage.XMLLoader
for processing. Would it work with sequence file?

This text file that I was referring to would be in hdfs itself. Is it still
different than using sequence file?

 Regards
 Bejoy.K.S

 On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  We have small xml files. Currently I am planning to append these small
  files to one file in hdfs so that I can take advantage of splits, larger
  blocks and sequential IO. What I am unsure is if it's ok to append one
 file
  at a time to this hdfs file
 
  Could someone suggest if this is ok? Would like to know how other do it.
 



Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
I am trying to look for examples that demonstrates using sequence files
including writing to it and then running mapred on it, but unable to find
one. Could you please point me to some examples of sequence files?

On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Mohit
  AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
 post the same to Pig user group for some workaround over the same.
 SequenceFIle is a preferred option when we want to store small
 files in hdfs and needs to be processed by MapReduce as it stores data in
 key value format.Since SequenceFileInputFormat is available at your
 disposal you don't need any custom input formats for processing the same
 using map reduce. It is a cleaner and better approach compared to just
 appending small xml file contents into a big file.

 On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com
 wrote:
 
   Mohit
 Rather than just appending the content into a normal text file or
   so, you can create a sequence file with the individual smaller file
  content
   as values.
  
Thanks. I was planning to use pig's
  org.apache.pig.piggybank.storage.XMLLoader
  for processing. Would it work with sequence file?
 
  This text file that I was referring to would be in hdfs itself. Is it
 still
  different than using sequence file?
 
   Regards
   Bejoy.K.S
  
   On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
  
We have small xml files. Currently I am planning to append these
 small
files to one file in hdfs so that I can take advantage of splits,
  larger
blocks and sequential IO. What I am unsure is if it's ok to append
 one
   file
at a time to this hdfs file
   
Could someone suggest if this is ok? Would like to know how other do
  it.
   
  
 



Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Thanks How does mapreduce work on sequence file? Is there an example I can
look at?

On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee 
arkoprovomukher...@gmail.com wrote:

 Hi,

 Let's say all the smaller files are in the same directory.

 Then u can do:

 *BufferedWriter output = new BufferedWriter
 (newOutputStreamWriter(fs.create(output_path,
 true)));  // Output path*

 *FileStatus[] output_files = fs.listStatus(new Path(input_path));  // Input
 directory*

 *for ( int i=0; i  output_files.length; i++ )  *

 *{*

 *   BufferedReader reader = new
 BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
 *

 *   String data;*

 *   data = reader.readLine();*

 *   while ( data != null ) *

 *  {*

 *output.write(data);*

 *  }*

 *reader.close*

 *}*

 *output.close*


 In case you have the files in multiple directories, call the code for each
 of them with different input paths.

 Hope this helps!

 Cheers

 Arko

 On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I am trying to look for examples that demonstrates using sequence files
  including writing to it and then running mapred on it, but unable to find
  one. Could you please point me to some examples of sequence files?
 
  On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks bejoy.had...@gmail.com
 wrote:
 
   Hi Mohit
AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
   post the same to Pig user group for some workaround over the same.
   SequenceFIle is a preferred option when we want to store small
   files in hdfs and needs to be processed by MapReduce as it stores data
 in
   key value format.Since SequenceFileInputFormat is available at your
   disposal you don't need any custom input formats for processing the
 same
   using map reduce. It is a cleaner and better approach compared to just
   appending small xml file contents into a big file.
  
   On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
  
On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks bejoy.had...@gmail.com
   wrote:
   
 Mohit
   Rather than just appending the content into a normal text
 file
  or
 so, you can create a sequence file with the individual smaller file
content
 as values.

  Thanks. I was planning to use pig's
org.apache.pig.piggybank.storage.XMLLoader
for processing. Would it work with sequence file?
   
This text file that I was referring to would be in hdfs itself. Is it
   still
different than using sequence file?
   
 Regards
 Bejoy.K.S

 On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia 
   mohitanch...@gmail.com
 wrote:

  We have small xml files. Currently I am planning to append these
   small
  files to one file in hdfs so that I can take advantage of splits,
larger
  blocks and sequential IO. What I am unsure is if it's ok to
 append
   one
 file
  at a time to this hdfs file
 
  Could someone suggest if this is ok? Would like to know how other
  do
it.
 

   
  
 



Re: Writing to SequenceFile fails

2012-02-21 Thread Mohit Anchlia
I am past this error. Looks like I needed to use CDH libraries. I changed
my maven repo. Now I am stuck at

*org.apache.hadoop.security.AccessControlException *since I am not writing
as user that owns the file. Looking online for solutions


On Tue, Feb 21, 2012 at 12:48 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am trying to write to the sequence file and it seems to be failing. Not
 sure why, Is there something I need to do

 String uri=hdfs://db1:54310/examples/testfile1.seq;

 FileSystem fs = FileSystem.*get*(URI.*create*(uri), conf);  //Fails
 on this line


 Caused by:
 *java.io.EOFException*

 at java.io.DataInputStream.readInt(
 *DataInputStream.java:375*)

 at org.apache.hadoop.ipc.Client$Connection.receiveResponse(
 *Client.java:501*)

 at org.apache.hadoop.ipc.Client$Connection.run(*Client.java:446*)



Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Need some more help. I wrote sequence file using below code but now when I
run mapreduce job I get file.*java.lang.ClassCastException*:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.hadoop.io.Text even though I didn't use LongWritable when I
originally wrote to the sequence

//Code to write to the sequence file. There is no LongWritable here

org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text();

BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath));

String line = *null*;

org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text();

*try* {

writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),

value.getClass(), SequenceFile.CompressionType.*RECORD*);

*int* i = 1;

*long* timestamp=System.*currentTimeMillis*();

*while* ((line = buffer.readLine()) != *null*) {

key.set(String.*valueOf*(timestamp));

value.set(line);

writer.append(key, value);

i++;

}


On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee 
arkoprovomukher...@gmail.com wrote:

 Hi,

 I think the following link will help:
 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

 Cheers
 Arko

 On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Sorry may be it's something obvious but I was wondering when map or
 reduce
  gets called what would be the class used for key and value? If I used
  org.apache.hadoop.io.Text
  value = *new* org.apache.hadoop.io.Text(); would the map be called with
   Text class?
 
  public void map(LongWritable key, Text value, Context context) throws
  IOException, InterruptedException {
 
 
  On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee 
  arkoprovomukher...@gmail.com wrote:
 
   Hi Mohit,
  
   I am not sure that I understand your question.
  
   But you can write into a file using:
   *BufferedWriter output = new BufferedWriter
   (new OutputStreamWriter(fs.create(my_path,true)));*
   *output.write(data);*
   *
   *
   Then you can pass that file as the input to your MapReduce program.
  
   *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
  
   From inside your Map/Reduce methods, I think you should NOT be
 tinkering
   with the input / output paths of that Map/Reduce job.
   Cheers
   Arko
  
  
   On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia mohitanch...@gmail.com
   wrote:
  
Thanks How does mapreduce work on sequence file? Is there an example
 I
   can
look at?
   
On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee 
arkoprovomukher...@gmail.com wrote:
   
 Hi,

 Let's say all the smaller files are in the same directory.

 Then u can do:

 *BufferedWriter output = new BufferedWriter
 (newOutputStreamWriter(fs.create(output_path,
 true)));  // Output path*

 *FileStatus[] output_files = fs.listStatus(new Path(input_path));
  //
Input
 directory*

 *for ( int i=0; i  output_files.length; i++ )  *

 *{*

 *   BufferedReader reader = new

  
 BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
 *

 *   String data;*

 *   data = reader.readLine();*

 *   while ( data != null ) *

 *  {*

 *output.write(data);*

 *  }*

 *reader.close*

 *}*

 *output.close*


 In case you have the files in multiple directories, call the code
 for
each
 of them with different input paths.

 Hope this helps!

 Cheers

 Arko

 On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia 
  mohitanch...@gmail.com
 wrote:

  I am trying to look for examples that demonstrates using sequence
   files
  including writing to it and then running mapred on it, but unable
  to
find
  one. Could you please point me to some examples of sequence
 files?
 
  On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks 
 bejoy.had...@gmail.com
  
 wrote:
 
   Hi Mohit
AFAIK XMLLoader in pig won't be suited for Sequence Files.
Please
   post the same to Pig user group for some workaround over the
  same.
   SequenceFIle is a preferred option when we want to
 store
small
   files in hdfs and needs to be processed by MapReduce as it
 stores
data
 in
   key value format.Since SequenceFileInputFormat is available at
  your
   disposal you don't need any custom input formats for processing
  the
 same
   using map reduce. It is a cleaner and better approach compared
 to
just
   appending small xml file contents into a big file.
  
   On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
  
On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks 
   bejoy.had...@gmail.com
   wrote:
   
 Mohit
   Rather than just appending the content into a normal
  text
 file
  or
 so, you can create

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
It looks like in mapper values are coming as binary instead of Text. Is
this expected from sequence file? I initially wrote SequenceFile with Text
values.

On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Need some more help. I wrote sequence file using below code but now when I
 run mapreduce job I get file.*java.lang.ClassCastException*:
 org.apache.hadoop.io.LongWritable cannot be cast to
 org.apache.hadoop.io.Text even though I didn't use LongWritable when I
 originally wrote to the sequence

 //Code to write to the sequence file. There is no LongWritable here

 org.apache.hadoop.io.Text key =
 *new* org.apache.hadoop.io.Text();

 BufferedReader buffer =
 *new* BufferedReader(*new* FileReader(filePath));

 String line =
 *null*;

 org.apache.hadoop.io.Text value =
 *new* org.apache.hadoop.io.Text();

 *try* {

 writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),

 value.getClass(), SequenceFile.CompressionType.
 *RECORD*);

 *int* i = 1;

 *long* timestamp=System.*currentTimeMillis*();

 *while* ((line = buffer.readLine()) != *null*) {

 key.set(String.*valueOf*(timestamp));

 value.set(line);

 writer.append(key, value);

 i++;

 }


   On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee 
 arkoprovomukher...@gmail.com wrote:

 Hi,

 I think the following link will help:
 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

 Cheers
 Arko

 On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Sorry may be it's something obvious but I was wondering when map or
 reduce
  gets called what would be the class used for key and value? If I used
  org.apache.hadoop.io.Text
  value = *new* org.apache.hadoop.io.Text(); would the map be called with
   Text class?
 
  public void map(LongWritable key, Text value, Context context) throws
  IOException, InterruptedException {
 
 
  On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee 
  arkoprovomukher...@gmail.com wrote:
 
   Hi Mohit,
  
   I am not sure that I understand your question.
  
   But you can write into a file using:
   *BufferedWriter output = new BufferedWriter
   (new OutputStreamWriter(fs.create(my_path,true)));*
   *output.write(data);*
   *
   *
   Then you can pass that file as the input to your MapReduce program.
  
   *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
  
   From inside your Map/Reduce methods, I think you should NOT be
 tinkering
   with the input / output paths of that Map/Reduce job.
   Cheers
   Arko
  
  
   On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
  
Thanks How does mapreduce work on sequence file? Is there an
 example I
   can
look at?
   
On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee 
arkoprovomukher...@gmail.com wrote:
   
 Hi,

 Let's say all the smaller files are in the same directory.

 Then u can do:

 *BufferedWriter output = new BufferedWriter
 (newOutputStreamWriter(fs.create(output_path,
 true)));  // Output path*

 *FileStatus[] output_files = fs.listStatus(new Path(input_path));
  //
Input
 directory*

 *for ( int i=0; i  output_files.length; i++ )  *

 *{*

 *   BufferedReader reader = new

  
 BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
 *

 *   String data;*

 *   data = reader.readLine();*

 *   while ( data != null ) *

 *  {*

 *output.write(data);*

 *  }*

 *reader.close*

 *}*

 *output.close*


 In case you have the files in multiple directories, call the code
 for
each
 of them with different input paths.

 Hope this helps!

 Cheers

 Arko

 On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia 
  mohitanch...@gmail.com
 wrote:

  I am trying to look for examples that demonstrates using
 sequence
   files
  including writing to it and then running mapred on it, but
 unable
  to
find
  one. Could you please point me to some examples of sequence
 files?
 
  On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks 
 bejoy.had...@gmail.com
  
 wrote:
 
   Hi Mohit
AFAIK XMLLoader in pig won't be suited for Sequence
 Files.
Please
   post the same to Pig user group for some workaround over the
  same.
   SequenceFIle is a preferred option when we want to
 store
small
   files in hdfs and needs to be processed by MapReduce as it
 stores
data
 in
   key value format.Since SequenceFileInputFormat is available at
  your
   disposal you don't need any custom input formats for
 processing
  the
 same
   using map reduce. It is a cleaner and better approach
 compared to
just
   appending small xml file contents into a big file.
  
   On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia 
 mohitanch

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Mohit Anchlia
Finally figured it out. I needed to use SequenceFileAstextInputFormat.
There is just lack of examples that makes it difficult when you start.

On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 It looks like in mapper values are coming as binary instead of Text. Is
 this expected from sequence file? I initially wrote SequenceFile with Text
 values.


 On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Need some more help. I wrote sequence file using below code but now when
 I run mapreduce job I get file.*java.lang.ClassCastException*:
 org.apache.hadoop.io.LongWritable cannot be cast to
 org.apache.hadoop.io.Text even though I didn't use LongWritable when I
 originally wrote to the sequence

 //Code to write to the sequence file. There is no LongWritable here

 org.apache.hadoop.io.Text key =
 *new* org.apache.hadoop.io.Text();

 BufferedReader buffer =
 *new* BufferedReader(*new* FileReader(filePath));

 String line =
 *null*;

 org.apache.hadoop.io.Text value =
 *new* org.apache.hadoop.io.Text();

 *try* {

 writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),

 value.getClass(), SequenceFile.CompressionType.
 *RECORD*);

 *int* i = 1;

 *long* timestamp=System.*currentTimeMillis*();

 *while* ((line = buffer.readLine()) != *null*) {

 key.set(String.*valueOf*(timestamp));

 value.set(line);

 writer.append(key, value);

 i++;

 }


   On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee 
 arkoprovomukher...@gmail.com wrote:

 Hi,

 I think the following link will help:
 http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

 Cheers
 Arko

 On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  Sorry may be it's something obvious but I was wondering when map or
 reduce
  gets called what would be the class used for key and value? If I used
  org.apache.hadoop.io.Text
  value = *new* org.apache.hadoop.io.Text(); would the map be called
 with
   Text class?
 
  public void map(LongWritable key, Text value, Context context) throws
  IOException, InterruptedException {
 
 
  On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee 
  arkoprovomukher...@gmail.com wrote:
 
   Hi Mohit,
  
   I am not sure that I understand your question.
  
   But you can write into a file using:
   *BufferedWriter output = new BufferedWriter
   (new OutputStreamWriter(fs.create(my_path,true)));*
   *output.write(data);*
   *
   *
   Then you can pass that file as the input to your MapReduce program.
  
   *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
  
   From inside your Map/Reduce methods, I think you should NOT be
 tinkering
   with the input / output paths of that Map/Reduce job.
   Cheers
   Arko
  
  
   On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
  
Thanks How does mapreduce work on sequence file? Is there an
 example I
   can
look at?
   
On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee 
arkoprovomukher...@gmail.com wrote:
   
 Hi,

 Let's say all the smaller files are in the same directory.

 Then u can do:

 *BufferedWriter output = new BufferedWriter
 (newOutputStreamWriter(fs.create(output_path,
 true)));  // Output path*

 *FileStatus[] output_files = fs.listStatus(new
 Path(input_path));  //
Input
 directory*

 *for ( int i=0; i  output_files.length; i++ )  *

 *{*

 *   BufferedReader reader = new

  
 BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath(;
 *

 *   String data;*

 *   data = reader.readLine();*

 *   while ( data != null ) *

 *  {*

 *output.write(data);*

 *  }*

 *reader.close*

 *}*

 *output.close*


 In case you have the files in multiple directories, call the
 code for
each
 of them with different input paths.

 Hope this helps!

 Cheers

 Arko

 On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia 
  mohitanch...@gmail.com
 wrote:

  I am trying to look for examples that demonstrates using
 sequence
   files
  including writing to it and then running mapred on it, but
 unable
  to
find
  one. Could you please point me to some examples of sequence
 files?
 
  On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks 
 bejoy.had...@gmail.com
  
 wrote:
 
   Hi Mohit
AFAIK XMLLoader in pig won't be suited for Sequence
 Files.
Please
   post the same to Pig user group for some workaround over the
  same.
   SequenceFIle is a preferred option when we want to
 store
small
   files in hdfs and needs to be processed by MapReduce as it
 stores
data
 in
   key value format.Since SequenceFileInputFormat is available
 at
  your
   disposal you don't need any custom input formats for
 processing

Re: Processing small xml files

2012-02-18 Thread Mohit Anchlia
On Fri, Feb 17, 2012 at 11:37 PM, Srinivas Surasani vas...@gmail.comwrote:

 Hi Mohit,

 You can use Pig for processing XML files. PiggyBank has build in load
 function to load the XML files.
  Also you can specify pig.maxCombinedSplitSize  and
 pig.splitCombination for efficient processing.


I can't seem to find examples of how to do xml processing in Pig. Can you
please send me some pointers? Basically I need to convert my xml to more
structured format using hadoop to write it to database.


 On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com
 wrote:
 
  I'm not sure what you mean by flat format here.
 
  In my scenario, I have an file input.xml that looks like this.
 
  myfile
section
   value1/value
/section
section
   value2/value
/section
  /myfile
 
  input.xml is a plain text file. Not a sequence file. If I read it with
 the
  XMLInputFormat my mapper gets called with (key, value) pairs that look
 like
  this:
 
  (, sectionvalue1/value/section)
  (, sectionvalue2/value/section)
 
  Where the keys are numerical offsets into the file. I then use this
  information to write a sequence file with these (key, value) pairs. So
 my
  Hadoop job that uses XMLInputFormat takes a text file as input and
 produces
  a sequence file as output.
 
  I don't know a rule of thumb for how many small files is too many. Maybe
  someone else on the list can chime in. I just know that when your
  throughput gets slow that's one possible cause to investigate.
 
 
  I need to install hadoop. Does this xmlinput format comes as part of the
  install? Can you please give me some pointers that would help me install
  hadoop and xmlinputformat if necessary?



 --
 -- Srinivas
 srini...@cloudwick.com



Re: Hadoop install

2012-02-18 Thread Mohit Anchlia
Thanks Do I have to do something special to get Mahout xmlinput format and
Pig with the new release of hadoop?

On Sat, Feb 18, 2012 at 6:42 AM, Tom Deutsch tdeut...@us.ibm.com wrote:

 Mohit - one place to start is here;

 http://hadoop.apache.org/common/releases.html#Download

 The release notes, as always, are well worth reading.

 
 Tom Deutsch
 Program Director
 Information Management
 Big Data Technologies
 IBM
 3565 Harbor Blvd
 Costa Mesa, CA 92626-1420
 tdeut...@us.ibm.com




 Mohit Anchlia mohitanch...@gmail.com
 02/18/2012 06:24 AM
 Please respond to
 common-user@hadoop.apache.org


 To
 common-user@hadoop.apache.org
 cc

 Subject
 Hadoop install






 What's the best way or guide to install latest hadoop. Is the latest
 Hadoop
 still .20 which comes up in google search. Could someone guide me with the
 latest hadoop distribution. I also need pig and mahout xmlinputformat.




Re: Processing small xml files

2012-02-17 Thread Mohit Anchlia
On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote:

 I'm not sure what you mean by flat format here.

 In my scenario, I have an file input.xml that looks like this.

 myfile
   section
  value1/value
   /section
   section
  value2/value
   /section
 /myfile

 input.xml is a plain text file. Not a sequence file. If I read it with the
 XMLInputFormat my mapper gets called with (key, value) pairs that look like
 this:

 (, sectionvalue1/value/section)
 (, sectionvalue2/value/section)

 Where the keys are numerical offsets into the file. I then use this
 information to write a sequence file with these (key, value) pairs. So my
 Hadoop job that uses XMLInputFormat takes a text file as input and produces
 a sequence file as output.

 I don't know a rule of thumb for how many small files is too many. Maybe
 someone else on the list can chime in. I just know that when your
 throughput gets slow that's one possible cause to investigate.


I need to install hadoop. Does this xmlinput format comes as part of the
install? Can you please give me some pointers that would help me install
hadoop and xmlinputformat if necessary?


Re: Processing small xml files

2012-02-12 Thread Mohit Anchlia
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill bill...@gmail.com wrote:
 I've used the Mahout XMLInputFormat. It is the right tool if you have an
 XML file with one type of section repeated over and over again and want to
 turn that into Sequence file where each repeated section is a value. I've
 found it helpful as a preprocessing step for converting raw XML input into
 something that can be handled by Hadoop jobs.

Thanks for the input.

Do you first convert it into flat format and then run another hadoop
job or do you just read xml sequence file and then perform reduce on
that. Is there an advantage of first converting it into a flat file
format?

 If you're worried about having lots of small files--specifically, about
 overwhelming your namenode because you have too many small
 files--the XMLInputFormat won't help with that. However, it may be possible
 to concatenate the small files into larger files, then have a Hadoop job
 that uses XMLInputFormat transform the large files into sequence files.

How many are too many for namenode? We have around 100M files and 100M
files every year


Developing MapReduce

2011-10-10 Thread Mohit Anchlia
I use eclipse. Is this http://wiki.apache.org/hadoop/EclipsePlugIn
still the best way to develop mapreduce programs in hadoop? Just want
to make sure before I go down this path.

Or should I just add hadoop jars in my classpath of eclipse and create
my own MapReduce programs.

Thanks


Re: incremental loads into hadoop

2011-10-03 Thread Mohit Anchlia
This process of managing looks like more pain long term. Would it be
easier to store in Hbase which has smaller block size?

What's the avg. file size?

On Sun, Oct 2, 2011 at 7:34 PM, Vitthal Suhas Gogate
gog...@hortonworks.com wrote:
 Agree with Bejoy, although to minimize the processing latency you can still
 choose to write more frequently to HDFS resulting into more number of
 smaller size files on HDFS rather than waiting to accumulate large size data
 before writing to HDFS.  As you may have more number of smaller files, it
 may be good to use combine file input format to not have large number of
 very small map tasks (one per file if less than block size).  Now after you
 process the input data, you may not want to leave these large number of
 small files on HDFS and hence  you can use a Hadoop Archive (HAR) tool to
 combine and store them into small number of bigger size files.. You can run
 this tool periodically in the background to archive the input that is
 already processed..  Archive tool itself is implemented as M/R job.

 Also to get some level of atomicity, you may copy the data to HDFS at a
 temporary location before moving it to final source partition (or
 directory).  Existing data loading tools may be doing that already.

 --Suhas Gogate


 On Sun, Oct 2, 2011 at 11:12 AM, bejoy.had...@gmail.com wrote:

 Sam
     Your understanding is right, hadoop  definitely works great with large
 volume of data. But not necessarily every file should be in the range of
 Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data,
 It is the total data processed by a map reduce job(rather jobs, most use
 cases uses more than one map reduce job for processing). It can be 10K files
 that make up the whole data.  Why not large number of small files? The over
 head on the name node in housekeeping all these large amount of meta
 data(file- block information) would be huge and there is definitely limits
 to it. But you can store smaller files together in splittable compressed
 formats. In general It is better to keep your file sizes atleast  same or
 more than your hdfs block size. In default it is 64Mb but larger clusters
 have higher values as multiples of 64. If your hdfs block size or your file
 sizes are lesser than the map reduce input split size then it is better
 using InputFormats like CombinedInput Format or so for MR jobs. Usually the
 MR input split size is equal to your hdfs block size. In short as a better
 practice your single file size should be at least equal to one hdfs block
 size.

 The approach of keeping a file opened for long to write and then reading
 the same parallely with a  map reduce, I fear it would work. AFAIK it won't.
 When a write is going on some blocks or the file itself would be locked, not
 really sure its the full file being locked or not. In short some blocks
 wouldn't be available for the concurrent Map Reduce Program during its
 processing.
       In your case a quick solution that comes to my mind is keep your real
 time data writing into the flume queue/buffer . Set it to a desired size
 once the queue gets full the data would be dumped into hdfs. Then as per
 your requirement you can kick off your jobs. If you are running MR jobs on
 very high frequency then make sure that for every run you have enough data
 to process and choose your max number of mappers and reducers effectively
 and  efficiently
   Then as the last one, I don't think for normal cases you don't need to
 dump your large volume of data into lfs and then do a copyFromLocal into
 hdfs. Tools like flume are build for those purposes I guess. I'm not an
 expert on Flume, you may need to do more reading on the same before
 implementing.

 This what I feel on your use case. But let's leave it open for the experts
 to comment.

 Hope it helps.
 Regards
 Bejoy K S

 -Original Message-
 From: Sam Seigal selek...@yahoo.com
 Sender: saurabh@gmail.com
 Date: Sat, 1 Oct 2011 15:50:46
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: incremental loads into hadoop

 Hi Bejoy,

 Thanks for the response.

 While reading about Hadoop, I have come across threads where people
 claim that Hadoop is not a good fit for a large amount of small files.
 It is good for files that are gigabyes/petabytes in size.

 If I am doing incremental loads, let's say every hour. Do I need to
 wait until maybe at the end of the day when enough data has been
 collected to start off a MapReduce job ? I am wondering if an open
 file that is continuously being written to can at the same time be
 used as an input to an M/R job ...

 Also, let's say I did not want to do a load straight off the DB. The
 service, when committing a transaction to the OLTP system, sends a
 message for that transaction to  a Hadoop Service that then writes the
 transaction into HDFS  (the services are connected to each other via a
 persisted queue, hence are eventually consistent, but that is 

Re: Binary content

2011-09-01 Thread Mohit Anchlia
On Thu, Sep 1, 2011 at 1:25 AM, Dieter Plaetinck
dieter.plaeti...@intec.ugent.be wrote:
 On Wed, 31 Aug 2011 08:44:42 -0700
 Mohit Anchlia mohitanch...@gmail.com wrote:

 Does map-reduce work well with binary contents in the file? This
 binary content is basically some CAD files and map reduce program need
 to read these files using some proprietry tool extract values and do
 some processing. Wondering if there are others doing similar type of
 processing. Best practices etc.

 yes, it works.  you just need to select the right input format.
 Personally i store all my binary files into a sequencefile (because my binary 
 files are small)

Thanks! Is there a specific tutorial I can focus on to see how it could be done?

 Dieter



Binary content

2011-08-31 Thread Mohit Anchlia
Does map-reduce work well with binary contents in the file? This
binary content is basically some CAD files and map reduce program need
to read these files using some proprietry tool extract values and do
some processing. Wondering if there are others doing similar type of
processing. Best practices etc.


Re: Question about RAID controllers and hadoop

2011-08-11 Thread Mohit Anchlia
On Thu, Aug 11, 2011 at 3:26 PM, Charles Wimmer cwim...@yahoo-inc.com wrote:
 We currently use P410s in 12 disk system.  Each disk is set up as a RAID0 
 volume.  Performance is at least as good as a bare disk.

Can you please share what throughput you see with P410s? Are these SATA or SAS?



 On 8/11/11 3:23 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com 
 wrote:

 If I read that email chain correctly then they were referring to the classic 
 JBOD vs multiple disks striped together conversation. The conversation that 
 was started here is referring to JBOD vs 1 RAID 0 per disk and the effects of 
 the raid controller on those independent raids.

 Matt

 -Original Message-
 From: Kai Voigt [mailto:k...@123.org]
 Sent: Thursday, August 11, 2011 5:17 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Question about RAID controllers and hadoop

 Yahoo did some testing 2 years ago: 
 http://markmail.org/message/xmzc45zi25htr7ry

 But updated benchmark would be interesting to see.

 Kai

 Am 12.08.2011 um 00:13 schrieb GOEKE, MATTHEW (AG/1000):

 My assumption would be that having a set of 4 raid 0 disks would actually be 
 better than having a controller that allowed pure JBOD of 4 disks due to the 
 cache on the controller. If anyone has any personal experience with this I 
 would love to know performance numbers but our infrastructure guy is doing 
 tests on exactly this over the next couple days so I will pass it along once 
 we have it.

 Matt

 -Original Message-
 From: Bharath Mundlapudi [mailto:bharathw...@yahoo.com]
 Sent: Thursday, August 11, 2011 5:00 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Question about RAID controllers and hadoop

 True, you need a P410 controller. You can create RAID0 for each disk to make 
 it as JBOD.


 -Bharath



 
 From: Koert Kuipers ko...@tresata.com
 To: common-user@hadoop.apache.org
 Sent: Thursday, August 11, 2011 2:50 PM
 Subject: Question about RAID controllers and hadoop

 Hello all,
 We are considering using low end HP proliant machines (DL160s and DL180s)
 for cluster nodes. However with these machines if you want to do more than 4
 hard drives then HP puts in a P410 raid controller. We would configure the
 RAID controller to function as JBOD, by simply creating multiple RAID
 volumes with one disk. Does anyone have experience with this setup? Is it a
 good idea, or am i introducing a i/o bottleneck?
 Thanks for your help!
 Best, Koert
 This e-mail message may contain privileged and/or confidential information, 
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error, 
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use 
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring, 
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for 
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage 
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export control 
 laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR) and 
 sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.



 --
 Kai Voigt
 k...@123.org








Re: maprd vs mapreduce api

2011-08-05 Thread Mohit Anchlia
On Fri, Aug 5, 2011 at 3:42 PM, Stevens, Keith D. steven...@llnl.gov wrote:
 The Mapper and Reducer class in org.apache.hadoop.mapreduce implement the 
 identity function.  So you should be able to just do

 conf.setMapperClass(org.apache.hadoop.mapreduce.Mapper.class);
 conf.setReducerClass(org.apache.hadoop.mapreduce.Reducer.class);

 without having to implement your own no-op classes.

 I recommend reading the javadoc for differences between the old api and the 
 new api, for example 
 http://hadoop.apache.org/common/docs/r0.20.2/api/index.html indicates the 
 different functionality of Mapper in the new api and it's dual use as the 
 identity mapper.

Sorry for asking on this thread :) Does Definitive Guide 2 cover the new api?

 Cheers,
 --Keith

 On Aug 5, 2011, at 1:15 PM, garpinc wrote:


 I was following this tutorial on version 0.19.1

 http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html

 I however wanted to use the latest version of api 0.20.2

 The original code in tutorial had following lines
 conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class);
 conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);

 both Identity classes are deprecated.. So seemed the solution was to create
 mapper and reducer as follows:
 public static class NOOPMapper
      extends MapperText, IntWritable, Text, IntWritable{


   public void map(Text key, IntWritable value, Context context
                   ) throws IOException, InterruptedException {

       context.write(key, value);

   }
 }

 public static class NOOPReducer
      extends ReducerText,IntWritable,Text,IntWritable {
   private IntWritable result = new IntWritable();

   public void reduce(Text key, IterableIntWritable values,
                      Context context
                      ) throws IOException, InterruptedException {
     context.write(key, result);
   }
 }


 And then with code:
               Configuration conf = new Configuration();
               Job job = new Job(conf, testdriver);

               job.setOutputKeyClass(Text.class);
               job.setOutputValueClass(IntWritable.class);

               job.setInputFormatClass(TextInputFormat.class);
               job.setOutputFormatClass(TextOutputFormat.class);

               FileInputFormat.addInputPath(job, new Path(In));
               FileOutputFormat.setOutputPath(job, new Path(Out));

               job.setMapperClass(NOOPMapper.class);
               job.setReducerClass(NOOPReducer.class);

               job.waitForCompletion(true);


 However I get this message
 java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
 cast to org.apache.hadoop.io.Text
       at TestDriver$NOOPMapper.map(TestDriver.java:1)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
       at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 11/08/01 16:41:01 INFO mapred.JobClient:  map 0% reduce 0%
 11/08/01 16:41:01 INFO mapred.JobClient: Job complete: job_local_0001
 11/08/01 16:41:01 INFO mapred.JobClient: Counters: 0



 Can anyone tell me what I need for this to work.

 Attached is full code..
 http://old.nabble.com/file/p32174859/TestDriver.java TestDriver.java
 --
 View this message in context: 
 http://old.nabble.com/maprd-vs-mapreduce-api-tp32174859p32174859.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: Hadoop cluster network requirement

2011-08-01 Thread Mohit Anchlia
Assuming everything is up this solution still will not scale given the latency, 
tcpip buffers, sliding window etc. See BDP

Sent from my iPad

On Aug 1, 2011, at 4:57 PM, Michael Segel michael_se...@hotmail.com wrote:

 
 Yeah what he said.
 Its never a good idea.
 Forget about losing a NN or a Rack, but just losing connectivity between data 
 centers. (It happens more than you think.)
 Your entire cluster in both data centers go down. Boom!
 
 Its a bad design. 
 
 You're better off doing two different clusters.
 
 Is anyone really trying to sell this as a design? That's even more scary.
 
 
 Subject: Re: Hadoop cluster network requirement
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 20:28:53 -0700
 To: common-user@hadoop.apache.org; saq...@margallacomm.com
 
 
 On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
 
 Thanks, I'm independently doing some digging into Hadoop networking
 requirements and 
 had a couple of quick follow-ups. Could I have some specific info on why
 different data centers 
 cannot be supported for master node and data node comms?
 Also, what 
 may be the benefits/use cases for such a scenario?
 
Most people who try to put the NN and DNs in different data centers are 
 trying to achieve disaster recovery:  one file system in multiple locations. 
  That isn't the way HDFS is designed and it will end in tears. There are 
 multiple problems:
 
 1) no guarantee that one block replica will be each data center (thereby 
 defeating the whole purpose!)
 2) assuming one can work out problem 1, during a network break, the NN will 
 lose contact from one half of the  DNs, causing a massive network 
 replication storm
 3) if one using MR on top of this HDFS, the shuffle will likely kill the 
 network in between (making MR performance pretty dreadful) is going to cause 
 delays for the DN heartbeats
 4) I don't even want to think about rebalancing.
 
... and I'm sure a lot of other problems I'm forgetting at the moment.  
 So don't do it.
 
If you want disaster recovery, set up two completely separate HDFSes and 
 run everything in parallel.
 


Re: Moving Files to Distributed Cache in MapReduce

2011-07-29 Thread Mohit Anchlia
Is this what you are looking for?

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

search for jobConf

On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote:
 Thanks for the response! However, I'm having an issue with this line

 Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

 because conf has private access in org.apache.hadoop.configured

 On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.comwrote:

 I hope my previous reply helps...

 On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu wrote:

  After moving it to the distributed cache, how would I call it within my
  MapReduce program?
 
  On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn mapred.le...@gmail.com
  wrote:
 
   Did you try using -files option in your hadoop jar command as:
  
   /usr/bin/hadoop jar jar name main class name -files  absolute path
  of
   file to be added to distributed cache input dir output dir
  
  
   On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu
  wrote:
  
Slight modification: I now know how to add files to the distributed
  file
cache, which can be done via this command placed in the main or run
   class:
   
       DistributedCache.addCacheFile(new
  URI(/user/hadoop/thefile.dat),
conf);
   
However I am still having trouble locating the file in the
 distributed
cache. *How do I call the file path of thefile.dat in the distributed
   cache
as a string?* I am using Hadoop 0.20.2
   
   
On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu
   wrote:
   
 Hi all,

 Does anybody have examples of how one moves files from the local
 filestructure/HDFS to the distributed cache in MapReduce? A Google
   search
 turned up examples in Pig but not MR.

 --
 Roger Chen
 UC Davis Genome Center

   
   
   
--
Roger Chen
UC Davis Genome Center
   
  
 
 
 
  --
  Roger Chen
  UC Davis Genome Center
 




 --
 Roger Chen
 UC Davis Genome Center



Re: Replication and failure

2011-07-28 Thread Mohit Anchlia
On Thu, Jul 28, 2011 at 12:17 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 I believe Tom's book (Hadoop: The Definitive Guide) covers this
 precisely well. Perhaps others too.

 Replication is a best-effort sort of thing. If 2 nodes are all that is
 available, then two replicas are written and one is left to the
 replica monitor service to replicate later as possible (leading to an
 underreplicated write for the moment). The scenario (with default
 configs) would only fail if you have 0 DataNodes 'available' to write
 to.

Thanks Harsh. I think you answered my question. I thought that
replication of 3 is a must. And for that you really need atleast 4
nodes so that if one of the nodes die it can still write to 3 nodes. I
am assuming writes to replica nodes are always synchronous and not
eventually consistent.

 Or are you asking about what happens when a DN fails during a write operation?

I am assuming there will be some errors in this case.


 On Thu, Jul 28, 2011 at 5:08 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Just trying to understand what happens if there are 3 nodes with
 replication set to 3 and one node fails. Does it fail the writes too?

 If there is a link that I can look at will be great. I tried searching
 but didn't see any definitive answer.

 Thanks,
 Mohit




 --
 Harsh J



Replication and failure

2011-07-27 Thread Mohit Anchlia
Just trying to understand what happens if there are 3 nodes with
replication set to 3 and one node fails. Does it fail the writes too?

If there is a link that I can look at will be great. I tried searching
but didn't see any definitive answer.

Thanks,
Mohit


Re: No. of Map and reduce tasks

2011-05-31 Thread Mohit Anchlia
What if I had multiple files in input directory, hadoop should then
fire parallel map jobs?


On Thu, May 26, 2011 at 7:21 PM, jagaran das jagaran_...@yahoo.co.in wrote:
 If you give really low size files, then the use of Big Block Size of Hadoop
 goes away.
 Instead try merging files.

 Hope that helps



 
 From: James Seigel ja...@tynt.com
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 6:04:07 PM
 Subject: Re: No. of Map and reduce tasks

 Set input split size really low,  you might get something.

 I'd rather you fire up some nix commands and pack together that file
 onto itself a bunch if times and the put it back into hdfs and let 'er
 rip

 Sent from my mobile. Please excuse the typos.

 On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 I think I understand that by last 2 replies :)  But my question is can
 I change this configuration to say split file into 250K so that
 multiple mappers can be invoked?

 On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote:
 have more data for it to process :)


 On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote:

 I ran a simple pig script on this file:

 -rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log

 that orders the contents by name. But it only created one mapper. How
 can I change this to distribute accross multiple machines?

 On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in
 wrote:
 Hi Mohit,

 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.

 Regards,
 Jagaran



 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks

 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.

 Also, can I specify how many map or reduce tasks I want to be launched?

 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.






Using own InputSplit

2011-05-27 Thread Mohit Anchlia
I am new to hadoop and from what I understand by default hadoop splits
the input into blocks. Now this might result in splitting a line of
record into 2 pieces and getting spread accross 2 maps. For eg: Line
abcd might get split into ab and cd. How can one prevent this in
hadoop and pig? I am looking for some examples where I can see how I
can specify my own split so that it logically splits based on the
record delimiter and not the block size. For some reason I am not able
to get right examples online.


Re: Using own InputSplit

2011-05-27 Thread Mohit Anchlia
thanks! Just thought it's better to post to multiple groups together
since I didn't know where it belongs :)

On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Please do not cross-post a question to multiple lists unless you're
 announcing something.

 What you describe, does not happen; and the way the splitting is done
 for Text files is explained in good detail here:
 http://wiki.apache.org/hadoop/HadoopMapReduce

 Hope this solves your doubt :)

 On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 I am new to hadoop and from what I understand by default hadoop splits
 the input into blocks. Now this might result in splitting a line of
 record into 2 pieces and getting spread accross 2 maps. For eg: Line
 abcd might get split into ab and cd. How can one prevent this in
 hadoop and pig? I am looking for some examples where I can see how I
 can specify my own split so that it logically splits based on the
 record delimiter and not the block size. For some reason I am not able
 to get right examples online.




 --
 Harsh J



Re: Using own InputSplit

2011-05-27 Thread Mohit Anchlia
Actually this link confused me

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input

Clearly, logical splits based on input-size is insufficient for many
applications since record boundaries must be respected. In such cases,
the application should implement a RecordReader, who is responsible
for respecting record-boundaries and presents a record-oriented view
of the logical InputSplit to the individual task.

But it looks like application doesn't need to do that since it's done
default? Or am I misinterpreting this entirely?

On Fri, May 27, 2011 at 10:08 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 thanks! Just thought it's better to post to multiple groups together
 since I didn't know where it belongs :)

 On Fri, May 27, 2011 at 10:04 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Please do not cross-post a question to multiple lists unless you're
 announcing something.

 What you describe, does not happen; and the way the splitting is done
 for Text files is explained in good detail here:
 http://wiki.apache.org/hadoop/HadoopMapReduce

 Hope this solves your doubt :)

 On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 I am new to hadoop and from what I understand by default hadoop splits
 the input into blocks. Now this might result in splitting a line of
 record into 2 pieces and getting spread accross 2 maps. For eg: Line
 abcd might get split into ab and cd. How can one prevent this in
 hadoop and pig? I am looking for some examples where I can see how I
 can specify my own split so that it logically splits based on the
 record delimiter and not the block size. For some reason I am not able
 to get right examples online.




 --
 Harsh J




How to copy over using dfs

2011-05-27 Thread Mohit Anchlia
If I have to overwrite a file I generally use

hadoop dfs -rm file
hadoop dfs -copyFromLocal or -put file

Is there a command to overwrite/replace the file instead of doing rm first?


Help with pigsetup

2011-05-26 Thread Mohit Anchlia
I sent this to pig apache user mailing list but have got no response.
Not sure if that list is still active.

thought I will post here if someone is able to help me.

I am in process of installing and learning pig. I have a hadoop
cluster and when I try to run pig in mapreduce mode it errors out:

Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1

Error before Pig is launched

ERROR 2999: Unexpected internal error. Failed to create DataStorage

java.lang.RuntimeException: Failed to create DataStorage
   at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
   at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
   at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
   at org.apache.pig.PigServer.init(PigServer.java:226)
   at org.apache.pig.PigServer.init(PigServer.java:215)
   at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
   at org.apache.pig.Main.run(Main.java:452)
   at org.apache.pig.Main.main(Main.java:107)
Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
failed on local exception: java.io.EOFException
   at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
   at org.apache.hadoop.ipc.Client.call(Client.java:743)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
   at $Proxy0.getProtocolVersion(Unknown Source)
   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
   at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
   at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
   at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
   at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
   at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
   at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
   ... 9 more
Caused by: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)


Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
For some reason I don't see that reply from Jonathan in my Inbox. I'll
try to google it.

What should be my next step in that case? I can't use pig then?

On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote:
 I think Jonathan Coveney's reply on user@pig answered your question.
 Its basically an issue of hadoop version differences between the one
 Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
 is newer.

 On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 I sent this to pig apache user mailing list but have got no response.
 Not sure if that list is still active.

 thought I will post here if someone is able to help me.

 I am in process of installing and learning pig. I have a hadoop
 cluster and when I try to run pig in mapreduce mode it errors out:

 Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1

 Error before Pig is launched
 
 ERROR 2999: Unexpected internal error. Failed to create DataStorage

 java.lang.RuntimeException: Failed to create DataStorage
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
       at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
       at org.apache.pig.PigServer.init(PigServer.java:226)
       at org.apache.pig.PigServer.init(PigServer.java:215)
       at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
       at org.apache.pig.Main.run(Main.java:452)
       at org.apache.pig.Main.main(Main.java:107)
 Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
 failed on local exception: java.io.EOFException
       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
       at org.apache.hadoop.ipc.Client.call(Client.java:743)
       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
       at $Proxy0.getProtocolVersion(Unknown Source)
       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
       at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
       at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
       at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
       ... 9 more
 Caused by: java.io.EOFException
       at java.io.DataInputStream.readInt(DataInputStream.java:375)
       at 
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)




 --
 Harsh J



Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com wrote:
 I'll repost it here then :)

 Here is what I had to do to get pig running with a different version of
 Hadoop (in my case, the cloudera build but I'd try this as well):


 build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when
 you run pig, put the pig-withouthadoop.jar on your classpath as well as your
 hadoop jar. In my case, I found that scripts only worked if I additionally
 manually registered the antlr jar:

Thanks Jonathan! I will give it a shot.


 register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar;

Is this a windows command? Sorry, have not used this before.


 2011/5/26 Mohit Anchlia mohitanch...@gmail.com

 For some reason I don't see that reply from Jonathan in my Inbox. I'll
 try to google it.

 What should be my next step in that case? I can't use pig then?

 On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote:
  I think Jonathan Coveney's reply on user@pig answered your question.
  Its basically an issue of hadoop version differences between the one
  Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
  is newer.
 
  On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I sent this to pig apache user mailing list but have got no response.
  Not sure if that list is still active.
 
  thought I will post here if someone is able to help me.
 
  I am in process of installing and learning pig. I have a hadoop
  cluster and when I try to run pig in mapreduce mode it errors out:
 
  Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1
 
  Error before Pig is launched
  
  ERROR 2999: Unexpected internal error. Failed to create DataStorage
 
  java.lang.RuntimeException: Failed to create DataStorage
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
        at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
        at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
        at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
        at org.apache.pig.PigServer.init(PigServer.java:226)
        at org.apache.pig.PigServer.init(PigServer.java:215)
        at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
        at org.apache.pig.Main.run(Main.java:452)
        at org.apache.pig.Main.main(Main.java:107)
  Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
  failed on local exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
        at org.apache.hadoop.ipc.Client.call(Client.java:743)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
        at
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
        at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
        at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
        ... 9 more
  Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
 
 
 
 
  --
  Harsh J
 




  1   2   >