On Nov 11, 2007, at 11:24 PM, Alan Ho wrote:
After looking long and hard for a good way to process XML. I've looked at the Streaming XML Record reader, and frankly - it doesn't look good.
Agreed, the Streaming XML record reader is a hack. My personal opinion is that the current design is broken enough to be problematic. I think the best approach would be to use a SAX parser and process each file as a single file split.
I've been using a StAX parser (the one that comes with J2EE 5). DOM and SAX doesn't cut it cause the RecordReader interface needs the ability to "pull" record by record.
I don't understand the problem. You should be able to implement the RecordReader interface with a SAX parser.
1. FileSplit - I'm not sure if I should even try to implement this capability. I'm working off the LineRecordReader example, and the low level manipulation of bytes seem really tricky. With StAX, I'm not able to track where in the file I've read up to, so I'm unable to figure out when to stop parsing a section of the file. The only way that I can see this work is to "extend" my own version of BufferInputStream to track how many bytes have been read.
WIth XML, you can't really start reading in the middle of the file. So I don't see any way to handle file splits that are less than a full file.
2. Should I even bother with JAXB ? If its cumbersome, then I'd rather not use it. Alternatively, when calling "next", the application returns a single record represented by XML.
I think JAXB would be overkill. A simple SAX parser should be fine, I think...
4. I'm I re-inventing the wheel - has someone else done this ? Please let me know.
I don't think anyone has done it yet. If you can make a generally useful InputFormat, it would be nice to contribute it back.
-- Owen
