huh this interesting .. obviously I am not thinking about this whole thing right ..
so in your mapper you parse the line into tokens and set the appropriate values on your writable by constructor or setters .. and let hadoop do all the serialization and deserialization .. and you tell hadoop how to do that by the read and write methods .. okay that makes more sense .. one last thing i still dont understand is what is the proper implementation of read and write methods .. if i have a bunch of strings in my writable then what should be the read method implementation .. I really appreciate the help from all you guys .. On Wed, Feb 2, 2011 at 12:52 PM, David Sinclair < dsincl...@chariotsolutions.com> wrote: > So create your writable as normal, and hadoop takes care of the > serialization/deserialization between mappers and reducers. > > For example, MyWritable is the same as you had previously, then in your > mapper output that writable > > class MyMapper extends Mapper<LongWritable, Text, LongWritable, MyWritable> > { > > private MyWritable writable =new MyWritable(); > > protected void map(LongWritable key, Text value, Context context) throws > IOException, InterruptedException { > // parse text > writable.setCounter(parseddata); > writable.setTimestamp(parseddata); > > // don't know what your key is > context.write(key, writable); > } > } > > and make sure you set the key/value output > > job.setMapOutputKeyClass(LongWritable .class); > job.setMapOutputValueClass(MyWritable.class); > > dave > > > On Wed, Feb 2, 2011 at 1:39 PM, Adeel Qureshi <adeelmahm...@gmail.com > >wrote: > > > i m reading text data and outputting text data so yeah its all text .. > the > > reason why i wanted to use custom writable classes is not for the mapper > > purposes .. you are right .. the easiest thing for is to receive the > > LongWritable and Text input in the mapper ... parse the text .. and deal > > with it .. but where I am having trouble is in passing the parsed > > information to the reducer .. right now I am putting a bunch of things as > > text and sending the same LongWritable and Text output to reducer but my > > text includes a bunch of things e.g. several fields separated by a > > delimiter > > .. this is the part that I am trying to improve .. instead of sending a > > bunch of delimited text I want to send an actual object to my reducer > > > > On Wed, Feb 2, 2011 at 12:33 PM, David Sinclair < > > dsincl...@chariotsolutions.com> wrote: > > > > > Are you storing your data as text or binary? > > > > > > If you are storing as text, your mapper is going to get Keys of > > > type LongWritable and values of type Text. Inside your mapper you would > > > parse out the strings and wouldn't be using your custom writable; that > is > > > unless you wanted your mapper/reducer to produce these. > > > > > > If you are storing as Binary, e.g. SequenceFiles, you use > > > the SequenceFileInputFormat and the sequence file reader will create > the > > > writables according to the mapper. > > > > > > dave > > > > > > On Wed, Feb 2, 2011 at 1:16 PM, Adeel Qureshi <adeelmahm...@gmail.com > > > >wrote: > > > > > > > okay so then the main question is how do I get the input line .. so > > that > > > I > > > > could parse it .. I am assuming it will then be passed to me in via > > data > > > > input stream .. > > > > > > > > So in my readFields function .. I am assuming I will get the whole > line > > > .. > > > > then I can parse it out and set my params .. something like this > > > > > > > > readFields(){ > > > > String line = in.readLine(); read the whole line > > > > > > > > //now apply the regular expression to parse it out > > > > data = pattern.group(1); > > > > time = pattern.group(2); > > > > user = pattern.group(3); > > > > } > > > > > > > > Is that right ??? > > > > > > > > > > > > > > > > On Wed, Feb 2, 2011 at 12:11 PM, Vijay <tec...@gmail.com> wrote: > > > > > > > > > Hadoop is not going to parse the line for you. Your mapper will > take > > > the > > > > > line, parse it and then turn it into your Writable so the next > phase > > > can > > > > > just work with your object. > > > > > > > > > > Thanks, > > > > > Vijay > > > > > On Feb 2, 2011 9:51 AM, "Adeel Qureshi" <adeelmahm...@gmail.com> > > > wrote: > > > > > > thanks for your reply .. so lets say my input files are formatted > > > like > > > > > this > > > > > > > > > > > > each line looks like this > > > > > > DATE TIME SERVER USER URL QUERY PORT ... > > > > > > > > > > > > so to read this I would create a writable mapper > > > > > > > > > > > > public class MyMapper implements Writable { > > > > > > Date date > > > > > > long time > > > > > > String server > > > > > > String user > > > > > > String url > > > > > > String query > > > > > > int port > > > > > > > > > > > > readFields(){ > > > > > > date = readDate(in); //not concerned with the actual date reading > > > > > function > > > > > > time = readLong(in); > > > > > > server = readText(in); > > > > > > ..... > > > > > > } > > > > > > } > > > > > > > > > > > > but I still dont understand how is hadoop gonna know to parse my > > line > > > > > into > > > > > > these tokens .. instead of map be using the whole line as one > token > > > > > > > > > > > > > > > > > > On Wed, Feb 2, 2011 at 11:42 AM, Harsh J <qwertyman...@gmail.com > > > > > > wrote: > > > > > > > > > > > >> See it this way: > > > > > >> > > > > > >> readFields(...) provides a DataInput stream that reads bytes > from > > a > > > > > >> binary stream, and write(...) provides a DataOutput stream that > > > writes > > > > > >> bytes to a binary stream. > > > > > >> > > > > > >> Now your data-structure may be a complex one, perhaps an array > of > > > > > >> items or a mapping of some, or just a set of different types of > > > > > >> objects. All you need to do is to think about how would you > > > > > >> _serialize_ your data structure into a binary stream, so that > you > > > may > > > > > >> _de-serialize_ it back from the same stream when required. > > > > > >> > > > > > >> About what goes where, I think looking up the definition of > > > > > >> 'serialization' will help. It is all in the ordering. If you > wrote > > A > > > > > >> before B, you read A before B - simple as that. > > > > > >> > > > > > >> This, or you could use a neat serialization library like Apache > > Avro > > > > > >> (http://avro.apache.org) and solve it in a simpler way with a > > > schema. > > > > > >> I'd recommend learning/using Avro for all > > > > > >> serialization/de-serialization needs. Especially for Hadoop > > > use-cases. > > > > > >> > > > > > >> On Wed, Feb 2, 2011 at 10:51 PM, Adeel Qureshi < > > > > adeelmahm...@gmail.com> > > > > > >> wrote: > > > > > >> > I have been trying to understand how to write a simple custom > > > > writable > > > > > >> class > > > > > >> > and I find the documentation available very vague and unclear > > > about > > > > > >> certain > > > > > >> > things. okay so here is the sample writable implementation in > > > > javadoc > > > > > of > > > > > >> > Writable interface > > > > > >> > > > > > > >> > public class MyWritable implements Writable { > > > > > >> > // Some data > > > > > >> > private int counter; > > > > > >> > private long timestamp; > > > > > >> > > > > > > >> > *public void write(DataOutput out) throws IOException { > > > > > >> > out.writeInt(counter); > > > > > >> > out.writeLong(timestamp); > > > > > >> > }* > > > > > >> > > > > > > >> > * public void readFields(DataInput in) throws IOException { > > > > > >> > counter = in.readInt(); > > > > > >> > timestamp = in.readLong(); > > > > > >> > }* > > > > > >> > > > > > > >> > public static MyWritable read(DataInput in) throws IOException > { > > > > > >> > MyWritable w = new MyWritable(); > > > > > >> > w.readFields(in); > > > > > >> > return w; > > > > > >> > } > > > > > >> > } > > > > > >> > > > > > > >> > so in readFields function we are simply saying read an int > from > > > the > > > > > >> > datainput and put that in counter .. and then read a long and > > put > > > > that > > > > > in > > > > > >> > timestamp variable .. what doesnt makes sense to me is what is > > the > > > > > format > > > > > >> of > > > > > >> > DataInput here .. what if there are multiple ints and multiple > > > longs > > > > > .. > > > > > >> how > > > > > >> > is the correct int gonna go in counter .. what if the data I > am > > > > > reading > > > > > >> in > > > > > >> > my mapper is a string line .. and I am using regular > expression > > to > > > > > parse > > > > > >> the > > > > > >> > tokens .. how do I specify which field goes where .. simply > > saying > > > > > >> readInt > > > > > >> > or readText .. how does that gets connected to the right stuff > > .. > > > > > >> > > > > > > >> > so in my case like I said I am reading from iis log files > where > > my > > > > > mapper > > > > > >> > input is a log line which contains usual log information like > > > data, > > > > > time, > > > > > >> > user, server, url, qry, responseTme etc .. I want to parse > these > > > > into > > > > > an > > > > > >> > object that can be passed to reducer instead of dumping all > that > > > > > >> information > > > > > >> > as text .. > > > > > >> > > > > > > >> > I would appreciate any help. > > > > > >> > Thanks > > > > > >> > Adeel > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> Harsh J > > > > > >> www.harshj.com > > > > > >> > > > > > > > > > > > > > > >