Hi Steve, When you want to read xml, you should provide your custom InputFormat which extends FileInputFormat.
and override the method isSplitable to not split a file , that means one xml file for one mapper. protected boolean isSplitable(FileSystem fs, Path filename) { return false; } Best Regards, Jeff zhang On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <steve....@yahoo.com> wrote: > > Does anybody have the similar issue? If you store XML files in HDFS, how > can you make sure a chunk reads by a mapper does not contain partial data of > an XML segment? > > For example: > > <title> > <book>book1</book> > <author>me</author> > ..............what if this is the boundary of a chunk?................... > <year>2009</year> > <book>book2</book> > > <author>me</author> > > <year>2009</year> > <book>book3</book> > > <author>me</author> > > <year>2009</year> > <title> > > > > > > >