Hi Steve,

When you want to read xml, you should provide your custom InputFormat which
extends FileInputFormat.

and override the method isSplitable to not split a file , that means one xml
file for one mapper.


  protected boolean isSplitable(FileSystem fs, Path filename) {
    return false;
  }



Best Regards,

Jeff zhang



On Thu, Oct 29, 2009 at 12:32 PM, Steve Gao <steve....@yahoo.com> wrote:

>
> Does anybody have the similar issue? If you store XML files in HDFS, how
> can you make sure a chunk reads by a mapper does not contain partial data of
> an XML segment?
>
> For example:
>
> <title>
> <book>book1</book>
> <author>me</author>
> ..............what if this is the boundary of a chunk?...................
> <year>2009</year>
> <book>book2</book>
>
> <author>me</author>
>
> <year>2009</year>
> <book>book3</book>
>
> <author>me</author>
>
> <year>2009</year>
> <title>
>
>
>
>
>
>
>

Reply via email to