Re: Seattle / PNW Hadoop + Lucene User Group?

2009-06-03 Thread Bhupesh Bansal
Great Bradford, 

Can you post some videos if you have some ?

Best
Bhupesh



On 6/3/09 11:58 AM, Bradford Stephens bradfordsteph...@gmail.com wrote:

 Hey everyone!
 I just wanted to give a BIG THANKS for everyone who came. We had over a
 dozen people, and a few got lost at UW :)  [I would have sent this update
 earlier, but I flew to Florida the day after the meeting].
 
 If you didn't come, you missed quite a bit of learning and topics. Such as:
 
 -Building a Social Media Analysis company on the Apache Cloud Stack
 -Cancer detection in images using Hadoop
 -Real-time OLAP
 -Scalable Lucene using Katta and Hadoop
 -Video and Network Flow
 -Custom Ranking in Lucene
 
 I'm going to update our wiki with the topics, and a few questions raised and
 the lessons we've learned.
 
 The next meetup will be June 24th. Be there, or be... boring :)
 
 Cheers,
 Bradford
 
 On Thu, Apr 16, 2009 at 3:27 PM, Bradford Stephens 
 bradfordsteph...@gmail.com wrote:
 
 Greetings,
 
 Would anybody be willing to join a PNW Hadoop and/or Lucene User Group
 with me in the Seattle area? I can donate some facilities, etc. -- I
 also always have topics to speak about :)
 
 Cheers,
 Bradford
 



Re: Randomize input file?

2009-05-21 Thread Bhupesh Bansal
Hmm , 

IMHO running a mapper only job will give you an output file
With same order. You should write a custom map-reduce job
Where map emits (Key:Integer.random() , value:line)
And reducer Output (key:NOTHING , value:line)

Reducer will sort on Integer.random() giving you a random ordering for your
input file. 

Best
Bhupesh


On 5/21/09 11:15 AM, Alex Loddengaard a...@cloudera.com wrote:

 Hi John,
 
 I don't know of a built-in way to do this.  Depending on how well you want
 to randomize, you could just run a MapReduce job with at least one map (the
 more maps, the more random) and no reduces.  When you run a job with no
 reduces, the shuffle phase is skipped entirely, and the intermediate outputs
 from the mappers are stored directly to HDFS.  Though I think each mapper
 will create one HDFS file, so you'll have to concatenate all files into a
 single file.
 
 The above isn't a very good way to randomize, but it's fairly easy to
 implement and should run pretty quickly.
 
 Hope this helps.
 
 Alex
 
 On Thu, May 21, 2009 at 7:18 AM, John Clarke clarke...@gmail.com wrote:
 
 Hi,
 
 I have a need to randomize my input file before processing. I understand I
 can chain Hadoop jobs together so the first could take the input file
 randomize it and then the second could take the randomized file and do the
 processing.
 
 The input file has one entry per line and I want to mix up the lines before
 the main processing.
 
 Is there an inbuilt ability I have missed or will I have to try and write a
 Hadoop program to shuffle my input file?
 
 Cheers,
 John
 



Re: Hadoop / MySQL

2009-04-29 Thread Bhupesh Bansal
Slightly off topic .. As being a non-mySQL solution

We have the same problem computing about 100G of data daily and serving it
online with minimum impact while data refresh.

We are using our in-house clone of amazon dynamo a key value Distributed
hash table store (Prject-Voldemort) for the serving side. Project-voldemort
supports a ReadOnlyStore which uses file based data/index. The interesting
part is that we compute the new data/index on hadoop and just Hot Swap it on
voldemort nodes. Total swap time is roughly scp/rsync time with actual
service impact time being very very minimal (closing and opening file
descriptors)

Thanks a lot for info on this thread have been very interesting.

Best
Bhupesh


On 4/29/09 11:48 AM, Todd Lipcon t...@cloudera.com wrote:

 On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski spo...@gmail.comwrote:
 
 If you have trouble loading your data into mysql using INSERTs or LOAD
 DATA, consider that MySQL supports CSV directly using the CSV storage
 engine. The only thing you have to do is to copy your hadoop produced
 csv file into the mysql data directory and issue a flush tables
 command to have mysql flush its caches and pickup the new file. Its
 very simple and you have the full set of sql commands available just
 as with innodb or myisam. What you don't get with the csv engine are
 indexes and foreign keys. Can't have it all, can you?
 
 
 The CSV storage engine is definitely an interesting option, but it has a
 couple downsides:
 
 - Like you mentioned, you don't get indexes. This seems like a huge deal to
 me - the reason you want to load data into MySQL instead of just keeping it
 in Hadoop is so you can service real-time queries. Not having any indexing
 kind of defeats the purpose there. This is especially true since MySQL only
 supports nested-loop joins, and there's no way of attaching metadata to a
 CSV table to say hey look, this table is already in sorted order so you can
 use a merge join.
 
 - Since CSV is a text based format, it's likely to be a lot less compact
 than a proper table. For example, a unix timestamp is likely to be ~10
 characters vs 4 bytes in a packed table.
 
 - I'm not aware of many people actually using CSV for anything except
 tutorials and training. Since it's not in heavy use by big mysql users, I
 wouldn't build a production system around it.
 
 Here's a wacky idea that I might be interested in hacking up if anyone's
 interested:
 
 What if there were a MyISAMTableOutputFormat in hadoop? You could use this
 as a reducer output and have it actually output .frm and .myd files onto
 HDFS, then simply hdfs -get them onto DB servers for realtime serving.
 Sounds like a fun hack I might be interested in if people would find it
 useful. Building the .myi indexes in Hadoop would be pretty killer as well,
 but potentially more difficult.
 
 -Todd



Lost TaskTracker Errors

2009-04-02 Thread Bhupesh Bansal
Hey Folks, 

Since last 2-3 days I am seeing many of these errors popping up in our
hadoop cluster. 

Task attempt_200904011612_0025_m_000120_0 failed to report status for 604
seconds. Killing

JobTracker logs are doesn¹t have any more info  And task tracker logs are
clean. 

The failures occurred with these symptoms
1. Datanodes will start timing out
2. hdfs will get extremely slow (hdfs ­ls will take like 2 mins Vs 1s in
normal mode)

The datanode logs on failing tasktracker nodes are filled up with
2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(172.16.216.64:50010,
storageID=DS-707090154-172.16.216.64-50010-1223506297192, infoPort=50075,
ipcPort=50020):Failed to transfer blk_-7774359493260170883_282858 to
172.16.216.62:50010 got java.net.SocketTimeoutException: 48 millis
timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/172.16.216.64:36689
remote=/172.16.216.62:50010]
at 
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java
:185)
at 
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.
java:159)
at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.
java:198)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873)
at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855)
at java.lang.Thread.run(Thread.java:619)


We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8G
RAM) with these properties
1. mapred.child.java.opts = Xmx600M
2. mapred.tasktracker.map.tasks.maximum = 8
3. mapred.tasktracker.reduce.tasks.maximum = 4
4. dfs.datanode.handler.count = 10
5. dfs.datanode.du.reserved = 10240
6. dfs.datanode.max.xcievers = 512

The map jobs writes a Ton of data for each record, does increasing
³dfs.datanode.handler.count² will help in this case ??  What other
configuration change can I try ??


Best
Bhupesh




RE: can't read the SequenceFile correctly

2009-02-06 Thread Bhupesh Bansal
Hey Tom, 

I got also burned by this ?? Why does BytesWritable.getBytes() returns
non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes() kind of 
function. 


Best
Bhupesh 



-Original Message-
From: Tom White [mailto:t...@cloudera.com]
Sent: Fri 2/6/2009 2:25 AM
To: core-user@hadoop.apache.org
Subject: Re: can't read the SequenceFile correctly
 
Hi Mark,

Not all the bytes stored in a BytesWritable object are necessarily
valid. Use BytesWritable#getLength() to determine how much of the
buffer returned by BytesWritable#getBytes() to use.

Tom

On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi,

 I have written binary files to a SequenceFile, seemeingly successfully, but
 when I read them back with the code below, after a first few reads I get the
 same number of bytes for the different files. What could go wrong?

 Thank you,
 Mark

  reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
 ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
 ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? * : ;
byte [] fileBytes = ((BytesWritable) value).getBytes();
System.out.printf([%s%s]\t%s\t%s\n, position, syncSeen,
 key, fileBytes.length);
position = reader.getPosition(); // beginning of next record
}




Re: job management in Hadoop

2009-01-30 Thread Bhupesh Bansal
Bill, 

Currently you can kill the job from the UI.
You have to enable the config in hadoop-default.xml

  namewebinterface.private.actions/name to be true

Best
Bhupesh


On 1/30/09 3:23 PM, Bill Au bill.w...@gmail.com wrote:

 Thanks.
 
 Anyone knows if there is plan to add this functionality to the web UI like
 job priority can be changed from both the command line and the web UI?
 
 Bill
 
 On Fri, Jan 30, 2009 at 5:54 PM, Arun C Murthy a...@yahoo-inc.com wrote:
 
 
 On Jan 30, 2009, at 2:41 PM, Bill Au wrote:
 
  Is there any way to cancel a job after it has been submitted?
 
 
 bin/hadoop job -kill jobid
 
 Arun
 



Re: tasktracker startup Time

2008-11-18 Thread Bhupesh Bansal
Thanks Steve, 

I will try kill -QUIT and report back.

Best
Bhupesh


On 11/18/08 5:45 AM, Steve Loughran [EMAIL PROTECTED] wrote:

 Bhupesh Bansal wrote:
 Hey folks, 
 
 I re-started my cluster after some node failures and saw couple of
 tasktrackers not being up (they finally did after abt 20 Mins)
 In the logs below check the blue timestamp to Red timestamp.
 
 I was just curious what do we do while starting tasktracker that could
 should take so much time ???
 
 
 
 2008-11-17 10:43:04,757 INFO org.mortbay.util.Container: Started
 [EMAIL PROTECTED]
 2008-11-17 11:12:38,373 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=TaskTracker, sessionId=
 2008-11-17 11:12:38,410 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=TaskTracker, port=47601
 
 
 Off the top of my head
 -DNS lookups can introduce delays if your network's DNS is wrong, but
 that shouldn't take so long
 -The task tracker depends on the job tracker and the filesystem being
 up. If the filesystem is recovering: no task trackers
 
 Next time, get the process ID (via a jps -v call), then do kill -QUIT on
 the process. This will print out to the process's console the stack
 trace of all its threads; this could track down where it is hanging



tasktracker startup Time

2008-11-17 Thread Bhupesh Bansal
Hey folks, 

I re-started my cluster after some node failures and saw couple of
tasktrackers not being up (they finally did after abt 20 Mins)
In the logs below check the blue timestamp to Red timestamp.

I was just curious what do we do while starting tasktracker that could
should take so much time ???


Best
Bhupesh 


2008-11-17 10:43:04,094 INFO org.apache.hadoop.mapred.TaskTracker:
STARTUP_MSG: 
/
STARTUP_MSG: Starting TaskTracker
STARTUP_MSG:   host = 
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.1
STARTUP_MSG:   build = ...
/
2008-11-17 10:43:04,292 INFO org.mortbay.util.Credential: Checking Resource
aliases
2008-11-17 10:43:04,400 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2008-11-17 10:43:04,401 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2008-11-17 10:43:04,401 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2008-11-17 10:43:04,713 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-11-17 10:43:04,753 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2008-11-17 10:43:04,757 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50060
2008-11-17 10:43:04,757 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-11-17 11:12:38,373 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=TaskTracker, sessionId=
2008-11-17 11:12:38,410 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=TaskTracker, port=47601
2008-11-17 11:12:38,487 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2008-11-17 11:12:38,488 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 47601: starting
2008-11-17 11:12:38,490 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 47601: starting
2008-11-17 11:12:38,491 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 47601: starting
2008-11-17 11:12:38,491 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 47601: starting
2008-11-17 11:12:38,491 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 47601: starting



Re: Mapper settings...

2008-11-06 Thread Bhupesh Bansal
In that case.

I will try to put a patch for this if nobody else is working on it

Best
Bhupesh
 





On 10/31/08 4:06 PM, Owen O'Malley [EMAIL PROTECTED] wrote:

 
 On Oct 31, 2008, at 3:15 PM, Bhupesh Bansal wrote:
 
 Why do we need these setters in JobConf ??
 
 jobConf.setMapOutputKeyClass(String.class);
 
 jobConf.setMapOutputValueClass(LongWritable.class);
 
 Just historical. The Mapper and Reducer interfaces didn't use to be
 generic. (Hadoop used to run on Java 1.4 too...)
 
 It would be nice to remove the need to call them. There is an old bug
 open to check for consistency HADOOP-1683. It would be even better to
 make the setting of both the map and reduce output types optional if
 they are specified by the template parameters.
 
 -- Owen



Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Hey guys, 


We at Linkedin are trying to run some Large Graph Analysis problems on
Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
machine with 8G on each.

Whats is the best way of doing that ?? Is there a way so that multiple
mappers on same machine can access a RAM cache ??  I read about hadoop
distributed cache looks like it's copies the file (hdfs / http) locally on
the slaves but not necessrily in RAM ??

Best
Bhupesh



Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Minor correction the graph size is about 6G and not 8G.


On 10/16/08 1:52 PM, Bhupesh Bansal [EMAIL PROTECTED] wrote:

 Hey guys, 
 
 
 We at Linkedin are trying to run some Large Graph Analysis problems on
 Hadoop. The fastest way to run would be to keep a copy of whole Graph in RAM
 at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-cores
 machine with 8G on each.
 
 Whats is the best way of doing that ?? Is there a way so that multiple
 mappers on same machine can access a RAM cache ??  I read about hadoop
 distributed cache looks like it's copies the file (hdfs / http) locally on
 the slaves but not necessrily in RAM ??
 
 Best
 Bhupesh
 



Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Can you elaborate here ,

Lets say I want to implement a DFS in my graph. I am not able to picturise
implementing it with doing graph in pieces without putting a depth bound to
(3-4). Lets say we have 200M (4GB) edges to start with

Best
Bhupesh



On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote:

 
 On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
 
 We at Linkedin are trying to run some Large Graph Analysis problems on
 Hadoop. The fastest way to run would be to keep a copy of whole
 Graph in RAM
 at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
 cores
 machine with 8G on each.
 
 The best way to deal with it is *not* to load the entire graph in one
 process. In the WebMap at Yahoo, we have a graph of the web that has
 roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6
   . To invert the links, you process the graph in pieces and resort
 based on the target. You'll get much better performance and scale to
 almost any size.
 
 Whats is the best way of doing that ?? Is there a way so that multiple
 mappers on same machine can access a RAM cache ??  I read about hadoop
 distributed cache looks like it's copies the file (hdfs / http)
 locally on
 the slaves but not necessrily in RAM ??
 
 You could mmap the file from distributed cache using MappedByteBuffer.
 Then there will be one copy between jvms...
 
 -- Owen



Re: Distributed cache Design

2008-10-16 Thread Bhupesh Bansal
Thanks Colin/ Owen

I will try some of the ideas here and report back.

Best
Bhupesh



On 10/16/08 4:05 PM, Colin Evans [EMAIL PROTECTED] wrote:

 The trick is to amortize your computation over the whole set.  So DFS
 for a single node will always be faster on an in-memory graph, but
 Hadoop is a good tool for computing all-pairs shortest paths in one shot
 if you re-frame the algorithm as a belief propagation and message
 passing algorithm.
 
 A lot of the time, the computation still explodes into n^2 or worse, so
 you need to use a binning or blocking algorithm, like the one described
 here:  http://www.youtube.com/watch?v=1ZDybXl212Q
 
 In the case of graphs, a blocking function would be to find overlapping
 strongly connected subgraphs where each subgraph fits in a reasonable
 amount of memory.  Then within each block, you do your computation and
 you pass a summary of that computation to adjacent blocks,which gets
 factored into the next computation.
 
 When we hooked up a Very Big Graph to our Hadoop cluster, we found that
 there were a lot of scaling problems, which went away when we started
 optimizing for streaming performance.
 
 -Colin
 
 
 
 Bhupesh Bansal wrote:
 Can you elaborate here ,
 
 Lets say I want to implement a DFS in my graph. I am not able to picturise
 implementing it with doing graph in pieces without putting a depth bound to
 (3-4). Lets say we have 200M (4GB) edges to start with
 
 Best
 Bhupesh
 
 
 
 On 10/16/08 3:01 PM, Owen O'Malley [EMAIL PROTECTED] wrote:
 
   
 On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
 
 
 We at Linkedin are trying to run some Large Graph Analysis problems on
 Hadoop. The fastest way to run would be to keep a copy of whole
 Graph in RAM
 at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
 cores
 machine with 8G on each.
   
 The best way to deal with it is *not* to load the entire graph in one
 process. In the WebMap at Yahoo, we have a graph of the web that has
 roughly 1 trillion links and 100 billion nodes. See
 http://tinyurl.com/4fgok6
   . To invert the links, you process the graph in pieces and resort
 based on the target. You'll get much better performance and scale to
 almost any size.
 
 
 Whats is the best way of doing that ?? Is there a way so that multiple
 mappers on same machine can access a RAM cache ??  I read about hadoop
 distributed cache looks like it's copies the file (hdfs / http)
 locally on
 the slaves but not necessrily in RAM ??
   
 You could mmap the file from distributed cache using MappedByteBuffer.
 Then there will be one copy between jvms...
 
 -- Owen
 
 
   
 



Re: Hadoop User Group (Bay Area) Oct 15th

2008-10-15 Thread Bhupesh Bansal
Hi , 

I didn't RSVP for this event. I would like to join with 2 of my colleagues.
Please let us know if we can ?

Best
Bhupesh




On 10/15/08 11:56 AM, Steve Gao [EMAIL PROTECTED] wrote:

 I am excited to see the slides. Would you send me a copy? Thanks.
 
 --- On Wed, 10/15/08, Nishant Khurana [EMAIL PROTECTED] wrote:
 From: Nishant Khurana [EMAIL PROTECTED]
 Subject: Re: Hadoop User Group (Bay Area) Oct 15th
 To: core-user@hadoop.apache.org
 Date: Wednesday, October 15, 2008, 9:45 AM
 
 I would love to see the slides too. I am specially interested in
 implementing database joins with Map Reduce.
 
 On Wed, Oct 15, 2008 at 7:24 AM, Johan Oskarsson [EMAIL PROTECTED]
 wrote:
 
 Since I'm not based in the San Francisco I would love to see the
 slides
 from this meetup uploaded somewhere. Especially the database join
 techniques talk sounds very interesting to me.
 
 /Johan
 
 Ajay Anand wrote:
 The next Bay Area User Group meeting is scheduled for October 15th at
 Yahoo! 2821 Mission College Blvd, Santa Clara, Building 1, Training
 Rooms 3  4 from 6:00-7:30 pm.
 
 Agenda:
 - Exploiting database join techniques for analytics with Hadoop: Jun
 Rao, IBM
 - Jaql Update: Kevin Beyer, IBM
 - Experiences moving a Petabyte Data Center: Sriram Rao, Quantcast
 
 Look forward to seeing you there!
 Ajay
 
 
 



Mapper OutOfMemoryError Revisited !!

2008-04-11 Thread bhupesh bansal

Hi Guys, I need to restart discussion around 
http://www.nabble.com/Mapper-Out-of-Memory-td14200563.html

 I saw the same OOM error in my map-reduce job in the map phase. 

1. I tried changing mapred.child.java.opts (bumped to 600M) 
2. io.sort.mb was kept at 100MB. 

I see the same errors still. 

I checked with debug the size of keyValBuffer in collect(), that is always
less than io.sort.mb and is spilled to disk properly.

I tried changing the map.task number to a very high number so that the input
is split into smaller chunks. It helps for a while as the map job went a bit
far (56% from 5%) but still see the problem.

 I tried bumping mapred.child.java.opts to 1000M , still got the same error. 

I also tried using the -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED] value in 
opts to
get the gc.log but didnt got any log??

 I tried using 'jmap -histo pid' to see the heap information, it didnt gave
me any meaningful or obvious problem point. 

What are the other possible memory hog during mapper phase ?? Is the input
file chunk kept fully in memory ?? 

Application: 

My map-reduce job is running with about 2G of input. in the Mapper phase I
read each line and output [5-500] (key,value) pair. so the intermediate data
should be really blown up.  will that be a problem. 

The Error file is attached
http://www.nabble.com/file/p16628181/error.txt error.txt 
-- 
View this message in context: 
http://www.nabble.com/Mapper-OutOfMemoryError-Revisited-%21%21-tp16628181p16628181.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.