Protecting NN JT UI with password

2011-10-03 Thread Shahnawaz Saifi
Hi,

I am looking to know, how to protect the Hadoop Web UIs running on
ports 50030, 50070 with password including HMASTER/60010?

--
Thanks,
Shah


Re: incremental loads into hadoop

2011-10-03 Thread Mohit Anchlia
This process of managing looks like more pain long term. Would it be
easier to store in Hbase which has smaller block size?

What's the avg. file size?

On Sun, Oct 2, 2011 at 7:34 PM, Vitthal Suhas Gogate
gog...@hortonworks.com wrote:
 Agree with Bejoy, although to minimize the processing latency you can still
 choose to write more frequently to HDFS resulting into more number of
 smaller size files on HDFS rather than waiting to accumulate large size data
 before writing to HDFS.  As you may have more number of smaller files, it
 may be good to use combine file input format to not have large number of
 very small map tasks (one per file if less than block size).  Now after you
 process the input data, you may not want to leave these large number of
 small files on HDFS and hence  you can use a Hadoop Archive (HAR) tool to
 combine and store them into small number of bigger size files.. You can run
 this tool periodically in the background to archive the input that is
 already processed..  Archive tool itself is implemented as M/R job.

 Also to get some level of atomicity, you may copy the data to HDFS at a
 temporary location before moving it to final source partition (or
 directory).  Existing data loading tools may be doing that already.

 --Suhas Gogate


 On Sun, Oct 2, 2011 at 11:12 AM, bejoy.had...@gmail.com wrote:

 Sam
     Your understanding is right, hadoop  definitely works great with large
 volume of data. But not necessarily every file should be in the range of
 Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data,
 It is the total data processed by a map reduce job(rather jobs, most use
 cases uses more than one map reduce job for processing). It can be 10K files
 that make up the whole data.  Why not large number of small files? The over
 head on the name node in housekeeping all these large amount of meta
 data(file- block information) would be huge and there is definitely limits
 to it. But you can store smaller files together in splittable compressed
 formats. In general It is better to keep your file sizes atleast  same or
 more than your hdfs block size. In default it is 64Mb but larger clusters
 have higher values as multiples of 64. If your hdfs block size or your file
 sizes are lesser than the map reduce input split size then it is better
 using InputFormats like CombinedInput Format or so for MR jobs. Usually the
 MR input split size is equal to your hdfs block size. In short as a better
 practice your single file size should be at least equal to one hdfs block
 size.

 The approach of keeping a file opened for long to write and then reading
 the same parallely with a  map reduce, I fear it would work. AFAIK it won't.
 When a write is going on some blocks or the file itself would be locked, not
 really sure its the full file being locked or not. In short some blocks
 wouldn't be available for the concurrent Map Reduce Program during its
 processing.
       In your case a quick solution that comes to my mind is keep your real
 time data writing into the flume queue/buffer . Set it to a desired size
 once the queue gets full the data would be dumped into hdfs. Then as per
 your requirement you can kick off your jobs. If you are running MR jobs on
 very high frequency then make sure that for every run you have enough data
 to process and choose your max number of mappers and reducers effectively
 and  efficiently
   Then as the last one, I don't think for normal cases you don't need to
 dump your large volume of data into lfs and then do a copyFromLocal into
 hdfs. Tools like flume are build for those purposes I guess. I'm not an
 expert on Flume, you may need to do more reading on the same before
 implementing.

 This what I feel on your use case. But let's leave it open for the experts
 to comment.

 Hope it helps.
 Regards
 Bejoy K S

 -Original Message-
 From: Sam Seigal selek...@yahoo.com
 Sender: saurabh@gmail.com
 Date: Sat, 1 Oct 2011 15:50:46
 To: common-user@hadoop.apache.org
 Reply-To: common-user@hadoop.apache.org
 Subject: Re: incremental loads into hadoop

 Hi Bejoy,

 Thanks for the response.

 While reading about Hadoop, I have come across threads where people
 claim that Hadoop is not a good fit for a large amount of small files.
 It is good for files that are gigabyes/petabytes in size.

 If I am doing incremental loads, let's say every hour. Do I need to
 wait until maybe at the end of the day when enough data has been
 collected to start off a MapReduce job ? I am wondering if an open
 file that is continuously being written to can at the same time be
 used as an input to an M/R job ...

 Also, let's say I did not want to do a load straight off the DB. The
 service, when committing a transaction to the OLTP system, sends a
 message for that transaction to  a Hadoop Service that then writes the
 transaction into HDFS  (the services are connected to each other via a
 persisted queue, hence are eventually consistent, but that is 

Fw: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Raj V
Sending it to the hadoop mailing list - I think this is a hadoop related 
problem and not related to Cloudera distribution.

Raj


- Forwarded Message -
From: Raj V rajv...@yahoo.com
To: CDH Users cdh-u...@cloudera.org
Sent: Friday, September 30, 2011 5:21 PM
Subject: pointing mapred.local.dir to a ramdisk


Hi all


I have been trying some experiments to improve performance. One of the 
experiments involved pointing mapred.local.dir to a RAM disk. To this end I 
created a 128MB RAM disk ( each of my map outputs are smaller than this) but I 
have not been able to get the task tracker to start.


I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from the 
task tracker log.


Tasktracker logs


2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to 
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added global 
filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port returned 
by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening 
the listener on 50060
2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: 
listener.getLocalPort() returned 50060 
webServer.getConnectors()[0].getLocalPort() returned 50060
2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound to 
port 50060
2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
2011-09-30 16:50:02,388 INFO org.mortbay.log: Started 
SelectChannelConnector@0.0.0.0:50060
2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting 
tasktracker with owner as mapred
2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not 
start task tracker because java.lang.NullPointerException
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
        at 
org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404)
        at 
org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255)
        at 
org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311)
        at 
org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618)
        at org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351)
        at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)


2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: 
SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5


and here is my mapred-site.xml file


property
    namemapred.local.dir/name
    value/ramdisk1/value
  /property


If I have a regular directory on a regular drive such as below - it works. If 
I don't mount the ramdisk - it works.


property
    namemapred.local.dir/name
    value/hadoop-dsk0/local,/hadoop-dsk1/local/value
  /property





The NullPointerException does not tell me what the error is or how to fix it.


From the logs it looks like some disk based operation failed. I can't guess I 
must also confess that this is the first time I am using an ext2 file system.


Any ideas?




Raj









Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Eric Caspole
Are you sure you have chown'd/chmod'd the ramdisk directory to be  
writeable by your hadoop user? I have played with this in the past  
and it should basically work.



On Oct 3, 2011, at 10:37 AM, Raj V wrote:

Sending it to the hadoop mailing list - I think this is a hadoop  
related problem and not related to Cloudera distribution.


Raj


- Forwarded Message -

From: Raj V rajv...@yahoo.com
To: CDH Users cdh-u...@cloudera.org
Sent: Friday, September 30, 2011 5:21 PM
Subject: pointing mapred.local.dir to a ramdisk


Hi all


I have been trying some experiments to improve performance. One of  
the experiments involved pointing mapred.local.dir to a RAM disk.  
To this end I created a 128MB RAM disk ( each of my map outputs  
are smaller than this) but I have not been able to get the task  
tracker to start.



I am running CDH3B3 ( hadoop-0.20.2+737) and here the error  
message from the task tracker log.



Tasktracker logs


2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to  
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via  
org.mortbay.log.Slf4jLog
2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer:  
Added global filtersafety (class=org.apache.hadoop.http.HttpServer 
$QuotingInputFilter)
2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer:  
Port returned by webServer.getConnectors()[0].getLocalPort()  
before open() is -1. Opening the listener on 50060
2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer:  
listener.getLocalPort() returned 50060 webServer.getConnectors() 
[0].getLocalPort() returned 50060
2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer:  
Jetty bound to port 50060

2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
2011-09-30 16:50:02,388 INFO org.mortbay.log: Started  
SelectChannelConnector@0.0.0.0:50060
2011-09-30 16:50:02,400 INFO  
org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs'  
truncater with mapRetainSize=-1 and reduceRetainSize=-1
2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker:  
Starting tasktracker with owner as mapred
2011-09-30 16:50:02,493 ERROR  
org.apache.hadoop.mapred.TaskTracker: Can not start task tracker  
because java.lang.NullPointerException

at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
at org.apache.hadoop.fs.RawLocalFileSystem.rename 
(RawLocalFileSystem.java:253)
at org.apache.hadoop.fs.ChecksumFileSystem.rename 
(ChecksumFileSystem.java:404)
at  
org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath 
(MRAsyncDiskService.java:255)
at  
org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes 
(MRAsyncDiskService.java:311)
at org.apache.hadoop.mapred.TaskTracker.initialize 
(TaskTracker.java:618)
at org.apache.hadoop.mapred.TaskTracker.init 
(TaskTracker.java:1351)
at org.apache.hadoop.mapred.TaskTracker.main 
(TaskTracker.java:3504)



2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker:  
SHUTDOWN_MSG:

/
SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5


and here is my mapred-site.xml file


property
namemapred.local.dir/name
value/ramdisk1/value
  /property


If I have a regular directory on a regular drive such as below -  
it works. If I don't mount the ramdisk - it works.



property
namemapred.local.dir/name
value/hadoop-dsk0/local,/hadoop-dsk1/local/value
  /property





The NullPointerException does not tell me what the error is or how  
to fix it.



From the logs it looks like some disk based operation failed. I  
can't guess I must also confess that this is the first time I am  
using an ext2 file system.



Any ideas?




Raj












Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Raj V
Eric

Yes. The owner is hdfs and group is hadoop and the directory is group 
writable(775).  This is tehe exact same configuration I have when I use real 
disks.But let me give it a try again to see if I overlooked something.
Thanks

Raj


From: Eric Caspole eric.casp...@amd.com
To: common-user@hadoop.apache.org
Sent: Monday, October 3, 2011 8:44 AM
Subject: Re: pointing mapred.local.dir to a ramdisk

Are you sure you have chown'd/chmod'd the ramdisk directory to be writeable by 
your hadoop user? I have played with this in the past and it should basically 
work.


On Oct 3, 2011, at 10:37 AM, Raj V wrote:

 Sending it to the hadoop mailing list - I think this is a hadoop related 
 problem and not related to Cloudera distribution.
 
 Raj
 
 
 - Forwarded Message -
 From: Raj V rajv...@yahoo.com
 To: CDH Users cdh-u...@cloudera.org
 Sent: Friday, September 30, 2011 5:21 PM
 Subject: pointing mapred.local.dir to a ramdisk
 
 
 Hi all
 
 
 I have been trying some experiments to improve performance. One of the 
 experiments involved pointing mapred.local.dir to a RAM disk. To this end I 
 created a 128MB RAM disk ( each of my map outputs are smaller than this) 
 but I have not been able to get the task tracker to start.
 
 
 I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message from 
 the task tracker log.
 
 
 Tasktracker logs
 
 
 2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to 
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via 
 org.mortbay.log.Slf4jLog
 2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added 
 global filtersafety 
 (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
 2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port 
 returned by webServer.getConnectors()[0].getLocalPort() before open() is 
 -1. Opening the listener on 50060
 2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer: 
 listener.getLocalPort() returned 50060 
 webServer.getConnectors()[0].getLocalPort() returned 50060
 2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty bound 
 to port 50060
 2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
 2011-09-30 16:50:02,388 INFO org.mortbay.log: Started 
 SelectChannelConnector@0.0.0.0:50060
 2011-09-30 16:50:02,400 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
 Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
 2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker: Starting 
 tasktracker with owner as mapred
 2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can not 
 start task tracker because java.lang.NullPointerException
         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
         at 
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
         at 
org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404)
         at 
org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255)
         at 
org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311)
         at 
org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618)
         at 
org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351)
         at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)
 
 
 2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker: 
 SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5
 
 
 and here is my mapred-site.xml file
 
 
 property
     namemapred.local.dir/name
     value/ramdisk1/value
   /property
 
 
 If I have a regular directory on a regular drive such as below - it works. 
 If I don't mount the ramdisk - it works.
 
 
 property
     namemapred.local.dir/name
     value/hadoop-dsk0/local,/hadoop-dsk1/local/value
   /property
 
 
 
 
 
 The NullPointerException does not tell me what the error is or how to fix 
 it.
 
 
 From the logs it looks like some disk based operation failed. I can't guess 
 I must also confess that this is the first time I am using an ext2 file 
 system.
 
 
 Any ideas?
 
 
 
 
 Raj
 
 
 
 
 
 
 






Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Vinod Kumar Vavilapalli
Must be related to some kind of permissions problems.

It will help if you can paste the corresponding source code for
FileUtil.copy(). Hard to track it with different versions, so.

Thanks,
+Vinod


On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote:

 Eric

 Yes. The owner is hdfs and group is hadoop and the directory is group
 writable(775).  This is tehe exact same configuration I have when I use real
 disks.But let me give it a try again to see if I overlooked something.
 Thanks

 Raj

 
 From: Eric Caspole eric.casp...@amd.com
 To: common-user@hadoop.apache.org
 Sent: Monday, October 3, 2011 8:44 AM
 Subject: Re: pointing mapred.local.dir to a ramdisk
 
 Are you sure you have chown'd/chmod'd the ramdisk directory to be
 writeable by your hadoop user? I have played with this in the past and it
 should basically work.
 
 
 On Oct 3, 2011, at 10:37 AM, Raj V wrote:
 
  Sending it to the hadoop mailing list - I think this is a hadoop related
 problem and not related to Cloudera distribution.
 
  Raj
 
 
  - Forwarded Message -
  From: Raj V rajv...@yahoo.com
  To: CDH Users cdh-u...@cloudera.org
  Sent: Friday, September 30, 2011 5:21 PM
  Subject: pointing mapred.local.dir to a ramdisk
 
 
  Hi all
 
 
  I have been trying some experiments to improve performance. One of the
 experiments involved pointing mapred.local.dir to a RAM disk. To this end I
 created a 128MB RAM disk ( each of my map outputs are smaller than this) but
 I have not been able to get the task tracker to start.
 
 
  I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message
 from the task tracker log.
 
 
  Tasktracker logs
 
 
  2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
  2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added
 global filtersafety
 (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
  2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port
 returned by webServer.getConnectors()[0].getLocalPort() before open() is -1.
 Opening the listener on 50060
  2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer:
 listener.getLocalPort() returned 50060
 webServer.getConnectors()[0].getLocalPort() returned 50060
  2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty
 bound to port 50060
  2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
  2011-09-30 16:50:02,388 INFO org.mortbay.log: Started
 SelectChannelConnector@0.0.0.0:50060
  2011-09-30 16:50:02,400 INFO
 org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
 with mapRetainSize=-1 and reduceRetainSize=-1
  2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker:
 Starting tasktracker with owner as mapred
  2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can
 not start task tracker because java.lang.NullPointerException
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
  at
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
  at
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404)
  at
 org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255)
  at
 org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311)
  at
 org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618)
  at
 org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351)
  at
 org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)
 
 
  2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker:
 SHUTDOWN_MSG:
  /
  SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5
 
 
  and here is my mapred-site.xml file
 
 
  property
  namemapred.local.dir/name
  value/ramdisk1/value
/property
 
 
  If I have a regular directory on a regular drive such as below - it
 works. If I don't mount the ramdisk - it works.
 
 
  property
  namemapred.local.dir/name
  value/hadoop-dsk0/local,/hadoop-dsk1/local/value
/property
 
 
 
 
 
  The NullPointerException does not tell me what the error is or how to
 fix it.
 
 
  From the logs it looks like some disk based operation failed. I can't
 guess I must also confess that this is the first time I am using an ext2
 file system.
 
 
  Any ideas?
 
 
 
 
  Raj
 
 
 
 
 
 
 
 
 
 
 
 



Re: Help - can't start namenode after disk full error

2011-10-03 Thread Shouguo Li
hi, Ryan

i'm trying to recover from disk full error on the namenode as well. i can fire 
up namenode after printf \xff\xff\xff\xee\xff  /var/name/current/edits
but now it's stuck in safe mode verifying blocks for hours... is there a way to 
check progress on that?
or is there a way to speed that verify process up?
thx

Re: incremental loads into hadoop

2011-10-03 Thread in.abdul
There is two method is there for processing OLTP

   1.  Hstremming or scibe  these are only methodes
   2. if not use chukuwa for storing the data so that when i you got a
   tesent volume then you can move to HDFS

Thanks and Regards,
S SYED ABDUL KATHER
9731841519


On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal [via Lucene] 
ml-node+s472066n3383949...@n3.nabble.com wrote:

 Hi,

 I am relatively new to Hadoop and was wondering how to do incremental
 loads into HDFS.

 I have a continuous stream of data flowing into a service which is
 writing to an OLTP store. Due to the high volume of data, we cannot do
 aggregations on the OLTP store, since this starts affecting the write
 performance.

 We would like to offload this processing into a Hadoop cluster, mainly
 for doing aggregations/analytics.

 The question is how can this continuous stream of data be
 incrementally loaded and processed into Hadoop ?

 Thank you,

 Sam


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3383949.html
  To unsubscribe from Lucene, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw.




-
THANKS AND REGARDS,
SYED ABDUL KATHER
--
View this message in context: 
http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3385689.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

lazy-loading of Reduce's input

2011-10-03 Thread Sami Dalouche
Hi,

My understanding is that when the reduce() method is called, the values
(IterableVALUEIN values) are stored in memory.

1/ Is that actually true ?
2/ If this is true, is there a way to lazy-load the inputs to use less
memory ? (e.g. load all the items by batches of 20, and discard the
previously fetched ones)
The only related option that I could find is mapreduce.reduce.input.limit,
but it doesn't do what I need.

The problem I am trying to solve is that my input values are huge objects
(serialized lucene indices using a custom Writable implementation), and
loading them all at once seems to require way too much memory.

Thank You,
Sami Dalouche


Re: incremental loads into hadoop

2011-10-03 Thread Sam Seigal
I have given HBase a fair amount of thought, and I am looking for
input. Instead of managing incremental loads myself, why not just
setup an HBase cluster ? What are some of the trade offs.
My primary use for this cluster would still be data
analysis/aggregation and not so much random access. Random access
would be something which is a nice to have in case there are problems,
and we want to examine the data ad-hoc.


On Sat, Oct 1, 2011 at 12:31 PM, in.abdul in.ab...@gmail.com wrote:
 There is two method is there for processing OLTP

   1.  Hstremming or scibe  these are only methodes
   2. if not use chukuwa for storing the data so that when i you got a
   tesent volume then you can move to HDFS

            Thanks and Regards,
        S SYED ABDUL KATHER
                9731841519


 On Sat, Oct 1, 2011 at 4:32 AM, Sam Seigal [via Lucene] 
 ml-node+s472066n3383949...@n3.nabble.com wrote:

 Hi,

 I am relatively new to Hadoop and was wondering how to do incremental
 loads into HDFS.

 I have a continuous stream of data flowing into a service which is
 writing to an OLTP store. Due to the high volume of data, we cannot do
 aggregations on the OLTP store, since this starts affecting the write
 performance.

 We would like to offload this processing into a Hadoop cluster, mainly
 for doing aggregations/analytics.

 The question is how can this continuous stream of data be
 incrementally loaded and processed into Hadoop ?

 Thank you,

 Sam


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3383949.html
  To unsubscribe from Lucene, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472066code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw.




 -
 THANKS AND REGARDS,
 SYED ABDUL KATHER
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/incremental-loads-into-hadoop-tp3383949p3385689.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Edward Capriolo
This directory can get very large, in many cases I doubt it would fit on a
ram disk.

Also RAM Disks tend to help most with random read/write, since hadoop is
doing mostly linear IO you may not see a great benefit from the RAM disk.



On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:

 Must be related to some kind of permissions problems.

 It will help if you can paste the corresponding source code for
 FileUtil.copy(). Hard to track it with different versions, so.

 Thanks,
 +Vinod


 On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote:

  Eric
 
  Yes. The owner is hdfs and group is hadoop and the directory is group
  writable(775).  This is tehe exact same configuration I have when I use
 real
  disks.But let me give it a try again to see if I overlooked something.
  Thanks
 
  Raj
 
  
  From: Eric Caspole eric.casp...@amd.com
  To: common-user@hadoop.apache.org
  Sent: Monday, October 3, 2011 8:44 AM
  Subject: Re: pointing mapred.local.dir to a ramdisk
  
  Are you sure you have chown'd/chmod'd the ramdisk directory to be
  writeable by your hadoop user? I have played with this in the past and it
  should basically work.
  
  
  On Oct 3, 2011, at 10:37 AM, Raj V wrote:
  
   Sending it to the hadoop mailing list - I think this is a hadoop
 related
  problem and not related to Cloudera distribution.
  
   Raj
  
  
   - Forwarded Message -
   From: Raj V rajv...@yahoo.com
   To: CDH Users cdh-u...@cloudera.org
   Sent: Friday, September 30, 2011 5:21 PM
   Subject: pointing mapred.local.dir to a ramdisk
  
  
   Hi all
  
  
   I have been trying some experiments to improve performance. One of
 the
  experiments involved pointing mapred.local.dir to a RAM disk. To this end
 I
  created a 128MB RAM disk ( each of my map outputs are smaller than this)
 but
  I have not been able to get the task tracker to start.
  
  
   I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message
  from the task tracker log.
  
  
   Tasktracker logs
  
  
   2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to
  org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
  org.mortbay.log.Slf4jLog
   2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added
  global filtersafety
  (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
   2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port
  returned by webServer.getConnectors()[0].getLocalPort() before open() is
 -1.
  Opening the listener on 50060
   2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer:
  listener.getLocalPort() returned 50060
  webServer.getConnectors()[0].getLocalPort() returned 50060
   2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty
  bound to port 50060
   2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
   2011-09-30 16:50:02,388 INFO org.mortbay.log: Started
  SelectChannelConnector@0.0.0.0:50060
   2011-09-30 16:50:02,400 INFO
  org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
  with mapRetainSize=-1 and reduceRetainSize=-1
   2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker:
  Starting tasktracker with owner as mapred
   2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker:
 Can
  not start task tracker because java.lang.NullPointerException
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
   at
 
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
   at
 
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404)
   at
 
 org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255)
   at
 
 org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311)
   at
  org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618)
   at
  org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351)
   at
  org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)
  
  
   2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker:
  SHUTDOWN_MSG:
   /
   SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5
  
  
   and here is my mapred-site.xml file
  
  
   property
   namemapred.local.dir/name
   value/ramdisk1/value
 /property
  
  
   If I have a regular directory on a regular drive such as below - it
  works. If I don't mount the ramdisk - it works.
  
  
   property
   namemapred.local.dir/name
   value/hadoop-dsk0/local,/hadoop-dsk1/local/value
 /property
  
  
  
  
  
   The NullPointerException does not tell me what the error is or how to
  fix it.
  
  
   From the logs it looks like some disk based operation failed. I can't
  guess I must also 

Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Raj V
Vinod

Carefully checked everything again. The permissions are 775 and the owner is 
hdfs:hadoop.  The task tracker creates a directory called toBeDeleted under 
/ramdisk so things do not seem to be permssion related.  The task tracker 
starts happily if I don't mount the ramdisk and leave everything else the same.



Raj




From: Vinod Kumar Vavilapalli vino...@hortonworks.com
To: common-user@hadoop.apache.org; Raj V rajv...@yahoo.com
Sent: Monday, October 3, 2011 9:07 AM
Subject: Re: pointing mapred.local.dir to a ramdisk

Must be related to some kind of permissions problems.

It will help if you can paste the corresponding source code for
FileUtil.copy(). Hard to track it with different versions, so.

Thanks,
+Vinod


On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote:

 Eric

 Yes. The owner is hdfs and group is hadoop and the directory is group
 writable(775).  This is tehe exact same configuration I have when I use real
 disks.But let me give it a try again to see if I overlooked something.
 Thanks

 Raj

 
 From: Eric Caspole eric.casp...@amd.com
 To: common-user@hadoop.apache.org
 Sent: Monday, October 3, 2011 8:44 AM
 Subject: Re: pointing mapred.local.dir to a ramdisk
 
 Are you sure you have chown'd/chmod'd the ramdisk directory to be
 writeable by your hadoop user? I have played with this in the past and it
 should basically work.
 
 
 On Oct 3, 2011, at 10:37 AM, Raj V wrote:
 
  Sending it to the hadoop mailing list - I think this is a hadoop related
 problem and not related to Cloudera distribution.
 
  Raj
 
 
  - Forwarded Message -
  From: Raj V rajv...@yahoo.com
  To: CDH Users cdh-u...@cloudera.org
  Sent: Friday, September 30, 2011 5:21 PM
  Subject: pointing mapred.local.dir to a ramdisk
 
 
  Hi all
 
 
  I have been trying some experiments to improve performance. One of the
 experiments involved pointing mapred.local.dir to a RAM disk. To this end I
 created a 128MB RAM disk ( each of my map outputs are smaller than this) but
 I have not been able to get the task tracker to start.
 
 
  I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message
 from the task tracker log.
 
 
  Tasktracker logs
 
 
  2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
  2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added
 global filtersafety
 (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
  2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port
 returned by webServer.getConnectors()[0].getLocalPort() before open() is -1.
 Opening the listener on 50060
  2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer:
 listener.getLocalPort() returned 50060
 webServer.getConnectors()[0].getLocalPort() returned 50060
  2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty
 bound to port 50060
  2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
  2011-09-30 16:50:02,388 INFO org.mortbay.log: Started
 SelectChannelConnector@0.0.0.0:50060
  2011-09-30 16:50:02,400 INFO
 org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
 with mapRetainSize=-1 and reduceRetainSize=-1
  2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker:
 Starting tasktracker with owner as mapred
  2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker: Can
 not start task tracker because java.lang.NullPointerException
          at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
          at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
          at
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
          at
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404)
          at
 org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255)
          at
 org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311)
          at
 org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618)
          at
 org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351)
          at
 org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)
 
 
  2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker:
 SHUTDOWN_MSG:
  /
  SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5
 
 
  and here is my mapred-site.xml file
 
 
  property
      namemapred.local.dir/name
      value/ramdisk1/value
    /property
 
 
  If I have a regular directory on a regular drive such as below - it
 works. If I don't mount the ramdisk - it works.
 
 
  property
      namemapred.local.dir/name
      value/hadoop-dsk0/local,/hadoop-dsk1/local/value
    /property
 
 
 
 
 
  The NullPointerException does not tell me what the error is or how to
 fix it.
 
 
  

Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Raj V
Edward

I understand the size limitations - but for my experiment the ramdisk size I 
have created is large enough. 
I think there will be substantial benefits by putting the intermediate map 
outputs on a ramdisk - size permitting, ofcourse, but I can't provide any 
numbers to substantiate my claim  given that I can't get it to run.

-best regards

Raj




From: Edward Capriolo edlinuxg...@gmail.com
To: common-user@hadoop.apache.org
Cc: Raj V rajv...@yahoo.com
Sent: Monday, October 3, 2011 10:36 AM
Subject: Re: pointing mapred.local.dir to a ramdisk

This directory can get very large, in many cases I doubt it would fit on a
ram disk.

Also RAM Disks tend to help most with random read/write, since hadoop is
doing mostly linear IO you may not see a great benefit from the RAM disk.



On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:

 Must be related to some kind of permissions problems.

 It will help if you can paste the corresponding source code for
 FileUtil.copy(). Hard to track it with different versions, so.

 Thanks,
 +Vinod


 On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote:

  Eric
 
  Yes. The owner is hdfs and group is hadoop and the directory is group
  writable(775).  This is tehe exact same configuration I have when I use
 real
  disks.But let me give it a try again to see if I overlooked something.
  Thanks
 
  Raj
 
  
  From: Eric Caspole eric.casp...@amd.com
  To: common-user@hadoop.apache.org
  Sent: Monday, October 3, 2011 8:44 AM
  Subject: Re: pointing mapred.local.dir to a ramdisk
  
  Are you sure you have chown'd/chmod'd the ramdisk directory to be
  writeable by your hadoop user? I have played with this in the past and it
  should basically work.
  
  
  On Oct 3, 2011, at 10:37 AM, Raj V wrote:
  
   Sending it to the hadoop mailing list - I think this is a hadoop
 related
  problem and not related to Cloudera distribution.
  
   Raj
  
  
   - Forwarded Message -
   From: Raj V rajv...@yahoo.com
   To: CDH Users cdh-u...@cloudera.org
   Sent: Friday, September 30, 2011 5:21 PM
   Subject: pointing mapred.local.dir to a ramdisk
  
  
   Hi all
  
  
   I have been trying some experiments to improve performance. One of
 the
  experiments involved pointing mapred.local.dir to a RAM disk. To this end
 I
  created a 128MB RAM disk ( each of my map outputs are smaller than this)
 but
  I have not been able to get the task tracker to start.
  
  
   I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message
  from the task tracker log.
  
  
   Tasktracker logs
  
  
   2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to
  org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
  org.mortbay.log.Slf4jLog
   2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added
  global filtersafety
  (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
   2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port
  returned by webServer.getConnectors()[0].getLocalPort() before open() is
 -1.
  Opening the listener on 50060
   2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer:
  listener.getLocalPort() returned 50060
  webServer.getConnectors()[0].getLocalPort() returned 50060
   2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty
  bound to port 50060
   2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
   2011-09-30 16:50:02,388 INFO org.mortbay.log: Started
  SelectChannelConnector@0.0.0.0:50060
   2011-09-30 16:50:02,400 INFO
  org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
  with mapRetainSize=-1 and reduceRetainSize=-1
   2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker:
  Starting tasktracker with owner as mapred
   2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker:
 Can
  not start task tracker because java.lang.NullPointerException
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
           at
 
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
           at
 
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404)
           at
 
 org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255)
           at
 
 org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311)
           at
  org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:618)
           at
  org.apache.hadoop.mapred.TaskTracker.init(TaskTracker.java:1351)
           at
  org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3504)
  
  
   2011-09-30 16:50:02,497 INFO org.apache.hadoop.mapred.TaskTracker:
  SHUTDOWN_MSG:
   /
   SHUTDOWN_MSG: Shutting down TaskTracker at HADOOP52-4/10.52.1.5
 

Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Joey Echeverria
Raj,

I just tried this on my CHD3u1 VM, and the ramdisk worked the first
time. So, it's possible you've hit a bug in CDH3b3 that was later
fixed. Can you enable debug logging in log4j.properties and then
repost your task tracker log? I think there might be more details that
it will print that will be helpful.

-Joey

On Mon, Oct 3, 2011 at 2:18 PM, Raj V rajv...@yahoo.com wrote:
 Edward

 I understand the size limitations - but for my experiment the ramdisk size I 
 have created is large enough.
 I think there will be substantial benefits by putting the intermediate map 
 outputs on a ramdisk - size permitting, ofcourse, but I can't provide any 
 numbers to substantiate my claim  given that I can't get it to run.

 -best regards

 Raj




From: Edward Capriolo edlinuxg...@gmail.com
To: common-user@hadoop.apache.org
Cc: Raj V rajv...@yahoo.com
Sent: Monday, October 3, 2011 10:36 AM
Subject: Re: pointing mapred.local.dir to a ramdisk

This directory can get very large, in many cases I doubt it would fit on a
ram disk.

Also RAM Disks tend to help most with random read/write, since hadoop is
doing mostly linear IO you may not see a great benefit from the RAM disk.



On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:

 Must be related to some kind of permissions problems.

 It will help if you can paste the corresponding source code for
 FileUtil.copy(). Hard to track it with different versions, so.

 Thanks,
 +Vinod


 On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote:

  Eric
 
  Yes. The owner is hdfs and group is hadoop and the directory is group
  writable(775).  This is tehe exact same configuration I have when I use
 real
  disks.But let me give it a try again to see if I overlooked something.
  Thanks
 
  Raj
 
  
  From: Eric Caspole eric.casp...@amd.com
  To: common-user@hadoop.apache.org
  Sent: Monday, October 3, 2011 8:44 AM
  Subject: Re: pointing mapred.local.dir to a ramdisk
  
  Are you sure you have chown'd/chmod'd the ramdisk directory to be
  writeable by your hadoop user? I have played with this in the past and it
  should basically work.
  
  
  On Oct 3, 2011, at 10:37 AM, Raj V wrote:
  
   Sending it to the hadoop mailing list - I think this is a hadoop
 related
  problem and not related to Cloudera distribution.
  
   Raj
  
  
   - Forwarded Message -
   From: Raj V rajv...@yahoo.com
   To: CDH Users cdh-u...@cloudera.org
   Sent: Friday, September 30, 2011 5:21 PM
   Subject: pointing mapred.local.dir to a ramdisk
  
  
   Hi all
  
  
   I have been trying some experiments to improve performance. One of
 the
  experiments involved pointing mapred.local.dir to a RAM disk. To this end
 I
  created a 128MB RAM disk ( each of my map outputs are smaller than this)
 but
  I have not been able to get the task tracker to start.
  
  
   I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message
  from the task tracker log.
  
  
   Tasktracker logs
  
  
   2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to
  org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
  org.mortbay.log.Slf4jLog
   2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added
  global filtersafety
  (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
   2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port
  returned by webServer.getConnectors()[0].getLocalPort() before open() is
 -1.
  Opening the listener on 50060
   2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer:
  listener.getLocalPort() returned 50060
  webServer.getConnectors()[0].getLocalPort() returned 50060
   2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty
  bound to port 50060
   2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
   2011-09-30 16:50:02,388 INFO org.mortbay.log: Started
  SelectChannelConnector@0.0.0.0:50060
   2011-09-30 16:50:02,400 INFO
  org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
  with mapRetainSize=-1 and reduceRetainSize=-1
   2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker:
  Starting tasktracker with owner as mapred
   2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker:
 Can
  not start task tracker because java.lang.NullPointerException
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
           at
 
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
           at
 
 org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:404)
           at
 
 org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:255)
           at
 
 org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:311)
           at
  

Re: lazy-loading of Reduce's input

2011-10-03 Thread Sami Dalouche
Just to make sure I was clear-enough :
- Is there a parameter that allows to set the size of the batch of elements
that are retrieved to memory while the reduce task iterates on the input
values ?

Thanks,
Sami dalouche

On Mon, Oct 3, 2011 at 1:42 PM, Sami Dalouche sa...@hopper.com wrote:

 Hi,

 My understanding is that when the reduce() method is called, the values
 (IterableVALUEIN values) are stored in memory.

 1/ Is that actually true ?
 2/ If this is true, is there a way to lazy-load the inputs to use less
 memory ? (e.g. load all the items by batches of 20, and discard the
 previously fetched ones)
 The only related option that I could find is mapreduce.reduce.input.limit,
 but it doesn't do what I need.

 The problem I am trying to solve is that my input values are huge objects
 (serialized lucene indices using a custom Writable implementation), and
 loading them all at once seems to require way too much memory.

 Thank You,
 Sami Dalouche



Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Raj V
Joey

Thanks. Will try and uppgrade to a newer version and check. I will also change 
the logs to debug and see if more information is available.

Raj




From: Joey Echeverria j...@cloudera.com
To: common-user@hadoop.apache.org; Raj V rajv...@yahoo.com
Sent: Monday, October 3, 2011 11:49 AM
Subject: Re: pointing mapred.local.dir to a ramdisk

Raj,

I just tried this on my CHD3u1 VM, and the ramdisk worked the first
time. So, it's possible you've hit a bug in CDH3b3 that was later
fixed. Can you enable debug logging in log4j.properties and then
repost your task tracker log? I think there might be more details that
it will print that will be helpful.

-Joey

On Mon, Oct 3, 2011 at 2:18 PM, Raj V rajv...@yahoo.com wrote:
 Edward

 I understand the size limitations - but for my experiment the ramdisk size I 
 have created is large enough.
 I think there will be substantial benefits by putting the intermediate map 
 outputs on a ramdisk - size permitting, ofcourse, but I can't provide any 
 numbers to substantiate my claim  given that I can't get it to run.

 -best regards

 Raj




From: Edward Capriolo edlinuxg...@gmail.com
To: common-user@hadoop.apache.org
Cc: Raj V rajv...@yahoo.com
Sent: Monday, October 3, 2011 10:36 AM
Subject: Re: pointing mapred.local.dir to a ramdisk

This directory can get very large, in many cases I doubt it would fit on a
ram disk.

Also RAM Disks tend to help most with random read/write, since hadoop is
doing mostly linear IO you may not see a great benefit from the RAM disk.



On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:

 Must be related to some kind of permissions problems.

 It will help if you can paste the corresponding source code for
 FileUtil.copy(). Hard to track it with different versions, so.

 Thanks,
 +Vinod


 On Mon, Oct 3, 2011 at 9:28 PM, Raj V rajv...@yahoo.com wrote:

  Eric
 
  Yes. The owner is hdfs and group is hadoop and the directory is group
  writable(775).  This is tehe exact same configuration I have when I use
 real
  disks.But let me give it a try again to see if I overlooked something.
  Thanks
 
  Raj
 
  
  From: Eric Caspole eric.casp...@amd.com
  To: common-user@hadoop.apache.org
  Sent: Monday, October 3, 2011 8:44 AM
  Subject: Re: pointing mapred.local.dir to a ramdisk
  
  Are you sure you have chown'd/chmod'd the ramdisk directory to be
  writeable by your hadoop user? I have played with this in the past and it
  should basically work.
  
  
  On Oct 3, 2011, at 10:37 AM, Raj V wrote:
  
   Sending it to the hadoop mailing list - I think this is a hadoop
 related
  problem and not related to Cloudera distribution.
  
   Raj
  
  
   - Forwarded Message -
   From: Raj V rajv...@yahoo.com
   To: CDH Users cdh-u...@cloudera.org
   Sent: Friday, September 30, 2011 5:21 PM
   Subject: pointing mapred.local.dir to a ramdisk
  
  
   Hi all
  
  
   I have been trying some experiments to improve performance. One of
 the
  experiments involved pointing mapred.local.dir to a RAM disk. To this end
 I
  created a 128MB RAM disk ( each of my map outputs are smaller than this)
 but
  I have not been able to get the task tracker to start.
  
  
   I am running CDH3B3 ( hadoop-0.20.2+737) and here the error message
  from the task tracker log.
  
  
   Tasktracker logs
  
  
   2011-09-30 16:50:00,689 INFO org.mortbay.log: Logging to
  org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
  org.mortbay.log.Slf4jLog
   2011-09-30 16:50:00,930 INFO org.apache.hadoop.http.HttpServer: Added
  global filtersafety
  (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
   2011-09-30 16:50:01,000 INFO org.apache.hadoop.http.HttpServer: Port
  returned by webServer.getConnectors()[0].getLocalPort() before open() is
 -1.
  Opening the listener on 50060
   2011-09-30 16:50:01,023 INFO org.apache.hadoop.http.HttpServer:
  listener.getLocalPort() returned 50060
  webServer.getConnectors()[0].getLocalPort() returned 50060
   2011-09-30 16:50:01,024 INFO org.apache.hadoop.http.HttpServer: Jetty
  bound to port 50060
   2011-09-30 16:50:01,024 INFO org.mortbay.log: jetty-6.1.14
   2011-09-30 16:50:02,388 INFO org.mortbay.log: Started
  SelectChannelConnector@0.0.0.0:50060
   2011-09-30 16:50:02,400 INFO
  org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater
  with mapRetainSize=-1 and reduceRetainSize=-1
   2011-09-30 16:50:02,422 INFO org.apache.hadoop.mapred.TaskTracker:
  Starting tasktracker with owner as mapred
   2011-09-30 16:50:02,493 ERROR org.apache.hadoop.mapred.TaskTracker:
 Can
  not start task tracker because java.lang.NullPointerException
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:213)
           at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
           at
 
 org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
           at
 
 

Monitoring Slow job.

2011-10-03 Thread patrick sang
Hi Hadoopers,

I am writing script to detect if there is any running job has been running
longer than X hours.

So far,  I use
./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
--- to get the list of running JobID

I am finding the way to get how long the job has been running from jobId.

In the web admin page (http://job:50030), we can easily see the duration of
each jobId pretty easily
from *Started at:, ** Running for:*. of each running job.

How do we get the such an information from command line?

hope this make sense.
Thanks you in advance,

-P


Re: error for deploying hadoop on macbook pro

2011-10-03 Thread Jignesh Patel
Harsh thanks,
It worked out now I am able to start the cluster but when I tried to see the 
jobConf history

http://localhost:50030/jobconf_history.jsp

I got the following message.
Missing 'logFile' for fetching job configuration!


On Sep 30, 2011, at 5:41 PM, Harsh J wrote:

 Since you're only just beginning, and have unknowingly issued multiple
 namenode -format commands, simply run the following and restart DN
 alone:
 
 $ rm -r /private/tmp/hadoop-hadoop-user/dfs/data
 
 (And please do not reformat namenode, lest you go out of namespace ID
 sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself
 of all HDFS files)
 
 On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel jign...@websoft.com wrote:
 Now I am able to make task tracker and job tracker running but I still have 
 following problem with datanode.
 
 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 
 Incompatible namespaceIDs in /private/tmp/hadoop-hadoop-user/dfs/data: 
 namenode namespaceID = 798142055; datanode namespaceID = 964022125
 
 
 On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote:
 
 
 
 
 
 
 
 I am trying to setup single node cluster using hadoop-0.20.204.0 and while 
 setting I found my job tracker and task tracker are not starting. I am 
 attaching the exception. I also don't know why my while formatting name 
 node my IP address still doesn't show 127.0.0.1 as follows.
 
 1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = Jignesh-MacBookPro.local/192.168.1.120
 STARTUP_MSG:   args = [-format]
 STARTUP_MSG:   version = 0.20.204.0
 STARTUP_MSG:   build = git://hrt8n35.cc1.ygridcore.net/ on branch 
 branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141; 
 compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011
 
 hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out
 hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log
 
 
 
 
 
 
 -- 
 Harsh J



Re: Monitoring Slow job.

2011-10-03 Thread Vitthal Suhas Gogate
I am not sure there is a easy way to get what you want on command line.. one
option is to use following command which would give you verbose job history
where you can find submit, Launch  Finish time (including duration on
FinishTime line).  I am using hadoop-0.20.205.0  branch. So check if you
have some such option for the version of hadoop you are using...

I am pasting sample output for my wordcount program,

bin/hadoop job -history job_output_directory_on_hdfs

==

horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
Warning: $HADOOP_HOME is deprecated.


Hadoop job: 0001_1317688277686_vgogate
=
Job tracker host name: job
job tracker start time: Sun May 16 08:53:51 PDT 1976
User: vgogate
JobName: word count
JobConf:
hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
Submitted At: 3-Oct-2011 17:31:17
Launched At: 3-Oct-2011 17:31:17 (0sec)
Finished At: 3-Oct-2011 17:31:50 (32sec)
Status: SUCCESS
Counters:

|Group Name|Counter name  |Map Value
|Reduce Value|Total Value|
---
|Job Counters  |Launched reduce tasks |0
|0 |1
|Job Counters  |SLOTS_MILLIS_MAPS |0
|0 |12,257
|Job Counters  |Total time spent by all reduces waiting
after reserving slots (ms)|0 |0 |0
|Job Counters  |Total time spent by all maps waiting after
reserving slots (ms)|0 |0 |0
|Job Counters  |Launched map tasks|0
|0 |1
|Job Counters  |Data-local map tasks  |0
|0 |1
|Job Counters  |SLOTS_MILLIS_REDUCES  |0
|0 |10,082
|File Output Format Counters   |Bytes Written |0
|61,192|61,192
|FileSystemCounters|FILE_BYTES_READ   |0
|70,766|70,766
|FileSystemCounters|HDFS_BYTES_READ   |112,056
|0 |112,056
|FileSystemCounters|FILE_BYTES_WRITTEN|92,325
|92,294|184,619
|FileSystemCounters|HDFS_BYTES_WRITTEN|0
|61,192|61,192
|File Input Format Counters|Bytes Read|111,933
|0 |111,933
|Map-Reduce Framework  |Reduce input groups   |0
|2,411 |2,411
|Map-Reduce Framework  |Map output materialized bytes |70,766
|0 |70,766
|Map-Reduce Framework  |Combine output records|2,411
|0 |2,411
|Map-Reduce Framework  |Map input records |2,643
|0 |2,643
|Map-Reduce Framework  |Reduce shuffle bytes  |0
|0 |0
|Map-Reduce Framework  |Reduce output records |0
|2,411 |2,411
|Map-Reduce Framework  |Spilled Records   |2,411
|2,411 |4,822
|Map-Reduce Framework  |Map output bytes  |120,995
|0 |120,995
|Map-Reduce Framework  |Combine input records |5,849
|0 |5,849
|Map-Reduce Framework  |Map output records|5,849
|0 |5,849
|Map-Reduce Framework  |SPLIT_RAW_BYTES   |123
|0 |123
|Map-Reduce Framework  |Reduce input records  |0
|2,411 |2,411
=

Task Summary

KindTotalSuccessfulFailedKilledStartTimeFinishTime

Setup11003-Oct-2011 17:31:203-Oct-2011 17:31:24
(4sec)
Map11003-Oct-2011 17:31:263-Oct-2011 17:31:30
(4sec)
Reduce11003-Oct-2011 17:31:323-Oct-2011 17:31:42
(10sec)
Cleanup11003-Oct-2011 17:31:443-Oct-2011
17:31:48 (4sec)



Analysis
=

Time taken by best performing map task task_201110031726_0001_m_00: 4sec
Average time taken by map tasks: 4sec
Worse performing map tasks:
TaskIdTimetaken
task_201110031726_0001_m_00 4sec
The last map task task_201110031726_0001_m_00 finished at (relative to
the Job launch time): 3-Oct-2011 17:31:30 (12sec)

Time taken by best performing shuffle task task_201110031726_0001_r_00:
7sec
Average time taken by shuffle tasks: 7sec
Worse performing shuffle tasks:
TaskIdTimetaken
task_201110031726_0001_r_00 7sec
The last shuffle task task_201110031726_0001_r_00 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:39 (21sec)

Time taken by best performing reduce task task_201110031726_0001_r_00:
2sec
Average time taken by reduce tasks: 2sec
Worse performing reduce tasks:
TaskIdTimetaken
task_201110031726_0001_r_00 2sec
The last reduce task task_201110031726_0001_r_00 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:42 

Re: error for deploying hadoop on macbook pro

2011-10-03 Thread Vitthal Suhas Gogate
Steps worked in the following document worked for me, except

-- JAVA_HOME need to be set correctly in the conf/hadoop-env.sh
-- By default on Mac OS X,  sshd is not running, so need to start it using
System Preferences/Sharing and add users who are allowed to do ssh.

http://www.stanford.edu/class/cs246/cs246-11-mmds/hw_files/hadoop_install.pdf

--Suhas

On Mon, Oct 3, 2011 at 2:21 PM, Jignesh Patel jign...@websoft.com wrote:

 Harsh thanks,
 It worked out now I am able to start the cluster but when I tried to see
 the jobConf history

 http://localhost:50030/jobconf_history.jsp

 I got the following message.
 Missing 'logFile' for fetching job configuration!


 On Sep 30, 2011, at 5:41 PM, Harsh J wrote:

  Since you're only just beginning, and have unknowingly issued multiple
  namenode -format commands, simply run the following and restart DN
  alone:
 
  $ rm -r /private/tmp/hadoop-hadoop-user/dfs/data
 
  (And please do not reformat namenode, lest you go out of namespace ID
  sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself
  of all HDFS files)
 
  On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel jign...@websoft.com
 wrote:
  Now I am able to make task tracker and job tracker running but I still
 have following problem with datanode.
 
  ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
 java.io.IOException: Incompatible namespaceIDs in
 /private/tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 798142055;
 datanode namespaceID = 964022125
 
 
  On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote:
 
 
 
 
 
 
 
  I am trying to setup single node cluster using hadoop-0.20.204.0 and
 while setting I found my job tracker and task tracker are not starting. I am
 attaching the exception. I also don't know why my while formatting name node
 my IP address still doesn't show 127.0.0.1 as follows.
 
  1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG:
  /
  STARTUP_MSG: Starting NameNode
  STARTUP_MSG:   host = Jignesh-MacBookPro.local/192.168.1.120
  STARTUP_MSG:   args = [-format]
  STARTUP_MSG:   version = 0.20.204.0
  STARTUP_MSG:   build = git://hrt8n35.cc1.ygridcore.net/ on branch
 branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141;
 compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011
 
  hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out
  hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log
 
 
 
 
 
 
  --
  Harsh J




Re: error for deploying hadoop on macbook pro

2011-10-03 Thread Vitthal Suhas Gogate
Sorry few more things,

-- localhost did not work for me.. I had to use my machine name returned by
hostname  e.g. horton-mac.local
-- Also change the localhost to your machine name  in  conf/slaves and
conf/masters file.

--Suhas

On Mon, Oct 3, 2011 at 5:44 PM, Vitthal Suhas Gogate 
gog...@hortonworks.com wrote:

 Steps worked in the following document worked for me, except

 -- JAVA_HOME need to be set correctly in the conf/hadoop-env.sh
 -- By default on Mac OS X,  sshd is not running, so need to start it using
 System Preferences/Sharing and add users who are allowed to do ssh.


 http://www.stanford.edu/class/cs246/cs246-11-mmds/hw_files/hadoop_install.pdf

 --Suhas


 On Mon, Oct 3, 2011 at 2:21 PM, Jignesh Patel jign...@websoft.com wrote:

 Harsh thanks,
 It worked out now I am able to start the cluster but when I tried to see
 the jobConf history

 http://localhost:50030/jobconf_history.jsp

 I got the following message.
 Missing 'logFile' for fetching job configuration!


 On Sep 30, 2011, at 5:41 PM, Harsh J wrote:

  Since you're only just beginning, and have unknowingly issued multiple
  namenode -format commands, simply run the following and restart DN
  alone:
 
  $ rm -r /private/tmp/hadoop-hadoop-user/dfs/data
 
  (And please do not reformat namenode, lest you go out of namespace ID
  sync yet again -- You can instead `hadoop dfs -rmr /*` to rid yourself
  of all HDFS files)
 
  On Sat, Oct 1, 2011 at 2:13 AM, Jignesh Patel jign...@websoft.com
 wrote:
  Now I am able to make task tracker and job tracker running but I still
 have following problem with datanode.
 
  ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
 java.io.IOException: Incompatible namespaceIDs in
 /private/tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID = 798142055;
 datanode namespaceID = 964022125
 
 
  On Sep 30, 2011, at 3:59 PM, Jignesh Patel wrote:
 
 
 
 
 
 
 
  I am trying to setup single node cluster using hadoop-0.20.204.0 and
 while setting I found my job tracker and task tracker are not starting. I am
 attaching the exception. I also don't know why my while formatting name node
 my IP address still doesn't show 127.0.0.1 as follows.
 
  1/09/30 15:50:36 INFO namenode.NameNode: STARTUP_MSG:
  /
  STARTUP_MSG: Starting NameNode
  STARTUP_MSG:   host = Jignesh-MacBookPro.local/192.168.1.120
  STARTUP_MSG:   args = [-format]
  STARTUP_MSG:   version = 0.20.204.0
  STARTUP_MSG:   build = git://hrt8n35.cc1.ygridcore.net/ on branch
 branch-0.20-security-204 -r 65e258bf0813ac2b15bb4c954660eaf9e8fba141;
 compiled by 'hortonow' on Thu Aug 25 23:35:31 UTC 2011
 
  hadoop-hadoop-user-tasktracker-Jignesh-MacBookPro.local.out
  hadoop-hadoop-user-jobtracker-Jignesh-MacBookPro.local.log
 
 
 
 
 
 
  --
  Harsh J





Adjusting column value size.

2011-10-03 Thread edward choi
Hi,

I have a question regarding the performance and column value size.
I need to store per row several million integers. (Several million is
important here)
I was wondering which method would be more beneficial performance wise.

1) Store each integer to a single column so that when a row is called,
several million columns will also be called. And the user would map each
column values to some kind of container (ex: vector, arrayList)
2) Store, for example, a thousand integers into a single column (by
concatenating them) so that when a row is called, only several thousand
columns will be called along. The user would have to split the column value
into 4 bytes and map the split integer to some kind of container (ex:
vector, arrayList)

I am curious which approach would be better. 1) would call several millions
of columns but no additional process is needed. 2) would call only several
thousands of columns but additional process is needed.
Any advice would be appreciated.

Ed