Re: Using own InputSplit
You can add that sometimes the input file is too small and you don't get the desired parallelism. Sent from a remote device. Please excuse any typos... Mike Segel On May 27, 2011, at 12:25 PM, Harsh J wrote: > Mohit, > > On Fri, May 27, 2011 at 10:44 PM, Mohit Anchlia > wrote: >> Actually this link confused me >> >> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input >> >> "Clearly, logical splits based on input-size is insufficient for many >> applications since record boundaries must be respected. In such cases, >> the application should implement a RecordReader, who is responsible >> for respecting record-boundaries and presents a record-oriented view >> of the logical InputSplit to the individual task." >> >> But it looks like application doesn't need to do that since it's done >> default? Or am I misinterpreting this entirely? > > For any type of InputFormat Hadoop provides along with itself, it > should already handle this for you (Text Files (say, \n-ended), > Sequence Files, Avro Datafiles). If you have a custom file format that > defines its own record delimiter character(s); you would surely need > to write your own InputFormat that splits across properly (the wiki > still helps on how to manage the reads across the first split and the > subsequents). > > -- > Harsh J >
Re: Using own InputSplit
Mohit, On Fri, May 27, 2011 at 10:44 PM, Mohit Anchlia wrote: > Actually this link confused me > > http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input > > "Clearly, logical splits based on input-size is insufficient for many > applications since record boundaries must be respected. In such cases, > the application should implement a RecordReader, who is responsible > for respecting record-boundaries and presents a record-oriented view > of the logical InputSplit to the individual task." > > But it looks like application doesn't need to do that since it's done > default? Or am I misinterpreting this entirely? For any type of InputFormat Hadoop provides along with itself, it should already handle this for you (Text Files (say, \n-ended), Sequence Files, Avro Datafiles). If you have a custom file format that defines its own record delimiter character(s); you would surely need to write your own InputFormat that splits across properly (the wiki still helps on how to manage the reads across the first split and the subsequents). -- Harsh J
Re: Using own InputSplit
Actually this link confused me http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Job+Input "Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, the application should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task." But it looks like application doesn't need to do that since it's done default? Or am I misinterpreting this entirely? On Fri, May 27, 2011 at 10:08 AM, Mohit Anchlia wrote: > thanks! Just thought it's better to post to multiple groups together > since I didn't know where it belongs :) > > On Fri, May 27, 2011 at 10:04 AM, Harsh J wrote: >> Mohit, >> >> Please do not cross-post a question to multiple lists unless you're >> announcing something. >> >> What you describe, does not happen; and the way the splitting is done >> for Text files is explained in good detail here: >> http://wiki.apache.org/hadoop/HadoopMapReduce >> >> Hope this solves your doubt :) >> >> On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia >> wrote: >>> I am new to hadoop and from what I understand by default hadoop splits >>> the input into blocks. Now this might result in splitting a line of >>> record into 2 pieces and getting spread accross 2 maps. For eg: Line >>> "abcd" might get split into "ab" and "cd". How can one prevent this in >>> hadoop and pig? I am looking for some examples where I can see how I >>> can specify my own split so that it logically splits based on the >>> record delimiter and not the block size. For some reason I am not able >>> to get right examples online. >>> >> >> >> >> -- >> Harsh J >> >
Re: Using own InputSplit
The query fit into mapreduce-user, since it primarily dealt with how Map/Reduce operates over data, just to clarify :) On Fri, May 27, 2011 at 10:38 PM, Mohit Anchlia wrote: > thanks! Just thought it's better to post to multiple groups together > since I didn't know where it belongs :) > > On Fri, May 27, 2011 at 10:04 AM, Harsh J wrote: >> Mohit, >> >> Please do not cross-post a question to multiple lists unless you're >> announcing something. >> >> What you describe, does not happen; and the way the splitting is done >> for Text files is explained in good detail here: >> http://wiki.apache.org/hadoop/HadoopMapReduce >> >> Hope this solves your doubt :) >> >> On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia >> wrote: >>> I am new to hadoop and from what I understand by default hadoop splits >>> the input into blocks. Now this might result in splitting a line of >>> record into 2 pieces and getting spread accross 2 maps. For eg: Line >>> "abcd" might get split into "ab" and "cd". How can one prevent this in >>> hadoop and pig? I am looking for some examples where I can see how I >>> can specify my own split so that it logically splits based on the >>> record delimiter and not the block size. For some reason I am not able >>> to get right examples online. >>> >> >> >> >> -- >> Harsh J >> > -- Harsh J
Re: Using own InputSplit
thanks! Just thought it's better to post to multiple groups together since I didn't know where it belongs :) On Fri, May 27, 2011 at 10:04 AM, Harsh J wrote: > Mohit, > > Please do not cross-post a question to multiple lists unless you're > announcing something. > > What you describe, does not happen; and the way the splitting is done > for Text files is explained in good detail here: > http://wiki.apache.org/hadoop/HadoopMapReduce > > Hope this solves your doubt :) > > On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia > wrote: >> I am new to hadoop and from what I understand by default hadoop splits >> the input into blocks. Now this might result in splitting a line of >> record into 2 pieces and getting spread accross 2 maps. For eg: Line >> "abcd" might get split into "ab" and "cd". How can one prevent this in >> hadoop and pig? I am looking for some examples where I can see how I >> can specify my own split so that it logically splits based on the >> record delimiter and not the block size. For some reason I am not able >> to get right examples online. >> > > > > -- > Harsh J >
Re: Using own InputSplit
Mohit, Please do not cross-post a question to multiple lists unless you're announcing something. What you describe, does not happen; and the way the splitting is done for Text files is explained in good detail here: http://wiki.apache.org/hadoop/HadoopMapReduce Hope this solves your doubt :) On Fri, May 27, 2011 at 10:25 PM, Mohit Anchlia wrote: > I am new to hadoop and from what I understand by default hadoop splits > the input into blocks. Now this might result in splitting a line of > record into 2 pieces and getting spread accross 2 maps. For eg: Line > "abcd" might get split into "ab" and "cd". How can one prevent this in > hadoop and pig? I am looking for some examples where I can see how I > can specify my own split so that it logically splits based on the > record delimiter and not the block size. For some reason I am not able > to get right examples online. > -- Harsh J
Using own InputSplit
I am new to hadoop and from what I understand by default hadoop splits the input into blocks. Now this might result in splitting a line of record into 2 pieces and getting spread accross 2 maps. For eg: Line "abcd" might get split into "ab" and "cd". How can one prevent this in hadoop and pig? I am looking for some examples where I can see how I can specify my own split so that it logically splits based on the record delimiter and not the block size. For some reason I am not able to get right examples online.