Does anybody have the similar issue? If you store XML files in HDFS, how can you make sure a chunk reads by a mapper does not contain partial data of an XML segment?
For example: <title> <book>book1</book> <author>me</author> ..............what if this is the boundary of a chunk?................... <year>2009</year> <book>book2</book> <author>me</author> <year>2009</year> <book>book3</book> <author>me</author> <year>2009</year> <title>