from:"Chris Dyer"

scaling experiments on a static cluster?

2008-03-12 Thread Chris Dyer

Hi Hadoop mavens-
I'm hoping someone out there will have a quick solution for me.  I'm
trying to run some very basic scaling experiments for a rapidly
approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
with 2 procs/node.  Ideally, I would want to run my code on clusters
of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
The problem is that I am not able to reconfigure the cluster (in the
long run, i.e., before a final version of the paper, I assume this
will be possible, but for now it's not).  Setting the number of
mappers/reducers does not seem to be a viable option, at least not in
the trivial way, since the physical layout of the input files makes
hadoop run different tasks of processes than I may request (most of my
jobs consist of multiple MR steps, the initial one always running on a
relatively small data set, which fits into a single block, and
therefore the Hadoop framework does honor my task number request on
the first job-- but during the later ones it does not).

My questions:
1) can I get around this limitation programmatically?  I.e., is there
a way to tell the framework to only use a subset of the nodes for DFS
/ mapping / reducing?
2) if not, what statistics would be good to report if I can only have
two data points -- a legacy single-core implementation of the
algorithms and a MapReduce version running on a cluster full cluster?

Thanks for any suggestions!
Chris

Re: scaling experiments on a static cluster?

2008-03-12 Thread Chris Dyer

Thanks-- that should work.  I'll follow up with the cluster
administrators to see if I can get this to happen.  To rebalance the
file storage can I just set the replication factor using hadoop dfs?
Chris

On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning [EMAIL PROTECTED] wrote:

  What about just taking down half of the nodes and then loading your data
  into the remainder?  Should take about 20 minutes each time you remove nodes
  but only a few seconds each time you add some.  Remember that you need to
  reload the data each time (or rebalance it if growing the cluster) to get
  realistic numbers.

  My suggested procedure would be to take all but 2 nodes down, and then

  - run test
  - double number of nodes
  - rebalance file storage
  - lather, rinse, repeat




  On 3/12/08 3:28 PM, Chris Dyer [EMAIL PROTECTED] wrote:

   Hi Hadoop mavens-
   I'm hoping someone out there will have a quick solution for me.  I'm
   trying to run some very basic scaling experiments for a rapidly
   approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
   with 2 procs/node.  Ideally, I would want to run my code on clusters
   of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
   The problem is that I am not able to reconfigure the cluster (in the
   long run, i.e., before a final version of the paper, I assume this
   will be possible, but for now it's not).  Setting the number of
   mappers/reducers does not seem to be a viable option, at least not in
   the trivial way, since the physical layout of the input files makes
   hadoop run different tasks of processes than I may request (most of my
   jobs consist of multiple MR steps, the initial one always running on a
   relatively small data set, which fits into a single block, and
   therefore the Hadoop framework does honor my task number request on
   the first job-- but during the later ones it does not).
  
   My questions:
   1) can I get around this limitation programmatically?  I.e., is there
   a way to tell the framework to only use a subset of the nodes for DFS
   / mapping / reducing?
   2) if not, what statistics would be good to report if I can only have
   two data points -- a legacy single-core implementation of the
   algorithms and a MapReduce version running on a cluster full cluster?
  
   Thanks for any suggestions!
   Chris

Re: runtime exceptions not killing job

2008-03-17 Thread Chris Dyer

I've noticed this behavior as well in 16.0 with RuntimeExceptions in general.

Chris

On Mon, Mar 17, 2008 at 6:14 PM, Matt Kent [EMAIL PROTECTED] wrote:
 I recently upgraded from Hadoop 0.14.1 to 0.16.1. Previously in 0.14.1,
  if a map or reduce task threw a runtime exception such as an NPE, the
  task, and ultimately the job, would fail in short order. I was running
  on job on my local 0.16.1 cluster today, and when the reduce tasks
  started throwing NPEs, the tasks just hung. Eventually they timed out
  and were killed, but is this expected behavior in 0.16.1? I'd prefer the
  job to fail quickly if NPEs are being thrown.

  Matt

  --
  Matt Kent
  Co-Founder
  Persai
  1221 40th St #113
  Emeryville, CA 94608
  [EMAIL PROTECTED]

using a set of MapFiles - getting the right partition

2008-03-20 Thread Chris Dyer

Hi all--
I would like to have a reducer generate a MapFile so that in later
processes I can look up the values associated with a few keys without
processing an entire sequence file.  However, if I have N reducers, I
will generate N different map files, so to pick the right map file I
will need to use the same partitioner as was used when partitioning
the keys to reducers (the reducer I have running emits one value for
each key it receives and no others).  Should this be done manually, ie
something like readers[partioner.getPartition(...)] or is there
another recommended method?

Eventually, I'm going to migrate to using HBase to store the key/value
pairs (since I'd to take advantage of HBase's ability to cache common
pairs in memory for faster retrieval), but I'm interested in seeing
what the performance is like just using MapFiles.

Thanks,
Chris

Re: Getting map ouput as final output by setting number of reduce to zero

2008-04-30 Thread Chris Dyer

Setting the number of reducers to zero has the advantage that no
sorting of the intermediate values will occur, which can save
considerable amounts of time if the output from the mapper is large.

Chris

On Wed, Apr 30, 2008 at 6:47 PM, jkupferman [EMAIL PROTECTED] wrote:

You actually dont need to do that, you can just use an IdentityReducer. The
IdentityReducer simply acts as a pass through, so whatever is written out
from your mapper, is passed through it and then written out to file.

vibhooti wrote:

Has any one tried setting number of reduce to zero and getting map's
output
as the final output?
I tried doing the same but my map output does not come to specified output
path for mapred.
let me know if someone has already done that. I am not able to find out,
where my map outputs are written.
http://hadoop.apache.org/core/docs/r0.16.3/mapred_tutorial.html say the
following

Reducer NONE

It is legal to set the number of reduce-tasks to *zero* if no reduction is
desired.

In this case the outputs of the map-tasks go directly to the FileSystem,
into the output path set by

setOutputPath(Path)http://hadoop.apache.org/core/docs/r0.16.3/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath%28org.apache.hadoop.fs.Path%29.
The framework does not sort the map-outputs before writing them out to the
FileSystem.

--
cheers,
Vibhooti

--
View this message in context:
http://www.nabble.com/Getting-map-ouput-as-final-output-by-setting-number-of-reduce-to-zero-tp16847897p16992742.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Know how many records remain?

2008-08-21 Thread Chris Dyer

Qin's question actually raises an issue-- it seems that using a
close() call, which does not throw IOException and which does not
provide the user with access to the OutputCollector object makes this
important piece of functionality (from a client's perspective) hard to
use.  Does anyone feel strongly about altering the contract so that
close() throws IOException and provides the implementer with the
OutputCollector object?

On Wed, Aug 20, 2008 at 1:43 PM, Qin Gao [EMAIL PROTECTED] wrote:
 Thanks Chris, that's exactly what I am trying to do. It solves my problem.

 On Wed, Aug 20, 2008 at 4:36 PM, Chris Dyer [EMAIL PROTECTED] wrote:

 Qin, since I can guess what you're trying to do with this (emit a
 bunch of expected counts at the end of EM?), you can write output
 during the call to close().  It involves having to store the output
 collector object as a member of the class, but this is a way to do a
 final flush on the object before it is destroyed.

 Chris

 On Wed, Aug 20, 2008 at 7:02 PM, Qin Gao [EMAIL PROTECTED] wrote:
  Hi mailing,
 
  Are there any way to know whether the mapper is processing the last
 record
  that assigned to this node, or know how many records remain to be
 processed
  in this node?
 
 
  Qin

Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-17 Thread Chris Dyer

Hi all-
One more question.

I'm looking for a lightweight way to serve data stored as key-value
pairs in a series of MapFiles or SequenceFiles.  HBase/Hypertable
offer a very robust, powerful solution to this problem with a bunch of
extra features like updates and column types, etc., that I don't need
at all.  But, I'm wondering if there might be something
ultra-lightweight that someone has come up with for a very restricted
(but important!) set of use cases.  Basically, I'd like to be able to
load the entire contents of a file key-value map file in DFS into
memory across many machines in my cluster so that I can access any of
it with ultra-low latencies.  I don't need updates--I just need
ultra-fast queries into a very large hash map (actually, just an array
would be sufficient).  This would correspond, approximately to the
sstable functionality that BigTable is implemented on top of, but
which is also useful for many, many things directly (refer to the
BigTable paper or
http://www.techworld.com/storage/features/index.cfm?featureid=3183).

This question may be better targeted to the HBase community, if so,
please let me know.  Has anyone else tried to deal with this?

Thanks--
Chris

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Chris Dyer

Memcached looks like it would be a reasonable solution for my problem,
although it's not optimal since it doesn't support an easy way of
initializing itself at start up, but I can work around that.  This may
be wishful thinking, but does anyone have any experience using the
Hadoop job/task framework to launch supporting tasks like a memcached
processes?  Is there anyone else thinking about issues of scheduling
other kinds of tasks (other than mappers and reducers) in Hadoop?

Thanks--
Chris

On Fri, Sep 19, 2008 at 2:53 PM, Alex Feinberg [EMAIL PROTECTED] wrote:
 Do any of CouchDB/Cassandra/other frameworks specifically do in-memory 
 serving?
 I haven't found any that do this explicitly. For now I've been using
 memcached for that
 functionality (with the usual memcached caveats).

 Ehcache may be another memcache-like solution
 (http://ehcache.sourceforge.net/), but
 it also provides an on-disk storage in addition to in-memory (thus
 avoiding the if a machine
 goes down, data is lost issue of memcached).

 On Fri, Sep 19, 2008 at 10:54 AM, James Moore [EMAIL PROTECTED] wrote:
 On Wed, Sep 17, 2008 at 10:05 PM, Chris Dyer [EMAIL PROTECTED] wrote:
 I'm looking for a lightweight way to serve data stored as key-value
 pairs in a series of MapFiles or SequenceFiles.

 Might be worth taking a look at CouchDB as well.  Haven't used it
 myself, so can't comment on how it might work for what you're
 describing.

 --
 James Moore | [EMAIL PROTECTED]
 Ruby and Ruby on Rails consulting
 blog.restphone.com




 --
 Alex Feinberg
 Platform Engineer, SocialMedia Networks

streaming silently failing when executing binaries with unresolved dependencies

2008-10-02 Thread Chris Dyer

Hi all-
I am using streaming with some c++ mappers and reducers.  One of the
binaries I attempted to run this evening had a dependency on a shared
library that did not exist on my cluster, so it failed during
execution.  However, the streaming framework didn't appear to
recognize this failure, and the job tracker indicated that the mapper
returned success, but did not produce any results.  Has anyone else
encountered this issue?  Should I open a JIRA issue about this?  I'm
using Hadoop-17.2
Thanks-
Chris

Re: Can anyone recommend me a inter-language data file format?

2008-11-03 Thread Chris Dyer

I've been using protocol buffers to serialize the data and then
encoding them in base64 so that I can then treat them like text.  This
obviously isn't optimal, but I'm assuming that this is only a short
term solution which won't be necessary when non-Java clients become
first class citizens of the Hadoop world.

Chris

On Mon, Nov 3, 2008 at 2:24 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:

 Protocol buffers, thrift?


 On 11/3/08 4:07 AM, Steve Loughran [EMAIL PROTECTED] wrote:

 Zhou, Yunqing wrote:
 embedded database cannot handle large-scale data, not very efficient
 I have about 1 billion records.
 these records should be passed through some modules.
 I mean a data exchange format similar to XML but more flexible and
 efficient.


 JSON
 CSV
 erlang-style records (name,value,value,value)
 RDF-triples in non-XML representations

 For all of these, you need to test with data that includes things like
 high unicode characters, single and double quotes, to see how well they
 get handled.

 you can actually append with XML by not having opening/closing tags,
 just stream out the entries to the tail of the file
 entry.../entry

 To read this in an XML parser, include it inside another XML file:

 ?xml version=1.0?
 !DOCTYPE log [
  !ENTITY log SYSTEM log.xml
 ]

 file
 log;
 /file

 I've done this for very big files, as long as you aren't trying to load
 it in-memory to a DOM, things should work

 --
 Steve Loughran  http://www.1060.org/blogxter/publish/5
 Author: Ant in Action   http://antbook.org/

Re: Counters missing from jobdetails.jsp page?

2008-11-22 Thread Chris Dyer

I've noticed that the counters sometimes disappear from jobdetails
page, but the job context object that is returned at the end of the
job always seems to have the correct values for my counters, which has
been my primary concern...
-Chris

On Sat, Nov 22, 2008 at 3:26 PM, Arthur van Hoff [EMAIL PROTECTED] wrote:
 Does the absence of any response mean that I am the only one experiencing 
 this?

 On Thu, Nov 20, 2008 at 5:24 PM, Arthur van Hoff [EMAIL PROTECTED] wrote:
 Hi,

 My map job has some user defined counters, these are displayed
 correctly after the job is finished, but while the job is running they
 only show up intermittently on the jobdetails.jsp page. Occasionally I
 do see them, but often they are missing on the the first map job is
 finished.

 Is this is a known issue?

 Is there a way to programmatically flush the counters every now and
 then, so that the jobtracker page displays intermediate results for
 long running jobs?

 Thanks.

 --
 Arthur van Hoff - Grand Master of Alphabetical Order
 The Ellerdale Project, Menlo Park, CA
 [EMAIL PROTECTED], 650-283-0842




 --
 Arthur van Hoff - Grand Master of Alphabetical Order
 The Ellerdale Project, Menlo Park, CA
 [EMAIL PROTECTED], 650-283-0842

Re: Run Map-Reduce multiple times

2008-12-07 Thread Chris Dyer

Hey Delip-
mapreduce doesn't really have any particular support for iterative
algorithms.  You just have to put a loop in the control program and
set the output path from the previous iteration to be the input path
in the next iteration.  This at least lets you control whether you
decide to keep around results of intermediate iterations or erase
them...
-Chris

On Mon, Dec 8, 2008 at 1:25 AM, Delip Rao [EMAIL PROTECTED] wrote:
 Hi,

 I need to run my map-reduce routines for several iterations so that
 the output of an iteration becomes the input to the next iteration. Is
 there a standard pattern to do this instead of calling
 JobClient.runJob() in a loop?

 Thanks,
 Delip