Thanks I did post this question to that group. All xml document are separated by a new line so that shouldn't be the issue, I think.
On Wed, Feb 22, 2012 at 12:44 PM, <bejoy.had...@gmail.com> wrote: > ** > Hi Mohit > I'm not an expert in pig and it'd be better using the pig user group for > pig specific queries. I'd try to help you with some basic trouble shooting > of the same > > It sounds strange that pig's XML Loader can't load larger XML files that > consists of multiple blocks. Or is it like, pig is not able to load the > concatenated files that you are trying with? If that is the case then it > could be because of some issues since you are just appending multiple xml > file contents into a single file. > > Pig users can give you some workarounds how they are dealing with loading > of small xml files that are stored efficiently. > > Regards > Bejoy K S > > From handheld, Please excuse typos. > ------------------------------ > *From: *Mohit Anchlia <mohitanch...@gmail.com> > *Date: *Wed, 22 Feb 2012 12:29:26 -0800 > *To: *<common-user@hadoop.apache.org>; <bejoy.had...@gmail.com> > *Subject: *Re: Splitting files on new line using hadoop fs > > > On Wed, Feb 22, 2012 at 12:23 PM, <bejoy.had...@gmail.com> wrote: > >> Hi Mohit >> AFAIK there is no default mechanism available for the same in >> hadoop. File is split into blocks just based on the configured block size >> during hdfs copy. While processing the file using Mapreduce the record >> reader takes care of the new lines even if a line spans across multiple >> blocks. >> >> Could you explain more on the use case that demands such a requirement >> while hdfs copy itself? >> > > I am using pig's XMLLoader in piggybank to read xml files concatenated > in a text file. But pig script doesn't work when file is big that causes > hadoop to split the files. > > Any suggestions on how I can make it work? Below is my simple script that > I would like to enhance, only if it starts working. Please note this works > for small files. > > > register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' > > raw = LOAD '/examples/testfile5.txt using > org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray); > > dump raw; > >> >> ------Original Message------ >> From: Mohit Anchlia >> To: common-user@hadoop.apache.org >> ReplyTo: common-user@hadoop.apache.org >> Subject: Splitting files on new line using hadoop fs >> Sent: Feb 23, 2012 01:45 >> >> How can I copy large text files using "hadoop fs" such that split occurs >> based on blocks + new lines instead of blocks alone? Is there a way to do >> this? >> >> >> >> Regards >> Bejoy K S >> >> From handheld, Please excuse typos. >> > >