On Wed, Feb 22, 2012 at 12:23 PM, <bejoy.had...@gmail.com> wrote: > Hi Mohit > AFAIK there is no default mechanism available for the same in > hadoop. File is split into blocks just based on the configured block size > during hdfs copy. While processing the file using Mapreduce the record > reader takes care of the new lines even if a line spans across multiple > blocks. > > Could you explain more on the use case that demands such a requirement > while hdfs copy itself? >
I am using pig's XMLLoader in piggybank to read xml files concatenated in a text file. But pig script doesn't work when file is big that causes hadoop to split the files. Any suggestions on how I can make it work? Below is my simple script that I would like to enhance, only if it starts working. Please note this works for small files. register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' raw = LOAD '/examples/testfile5.txt using org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray); dump raw; > > ------Original Message------ > From: Mohit Anchlia > To: common-user@hadoop.apache.org > ReplyTo: common-user@hadoop.apache.org > Subject: Splitting files on new line using hadoop fs > Sent: Feb 23, 2012 01:45 > > How can I copy large text files using "hadoop fs" such that split occurs > based on blocks + new lines instead of blocks alone? Is there a way to do > this? > > > > Regards > Bejoy K S > > From handheld, Please excuse typos. >