How would you use a regular SAX parser to implement the "next" method in the RecordReader ?
Regards, Alan Ho ----- Original Message ---- From: Owen O'Malley <[EMAIL PROTECTED]> To: [email protected] Sent: Sunday, November 11, 2007 11:50:11 PM Subject: Re: Building good XML parsing library Hadoop On Nov 11, 2007, at 11:24 PM, Alan Ho wrote: > After looking long and hard for a good way to process XML. I've > looked at the Streaming XML Record reader, and frankly - it doesn't > look good. Agreed, the Streaming XML record reader is a hack. My personal opinion is that the current design is broken enough to be problematic. I think the best approach would be to use a SAX parser and process each file as a single file split. > I've been using a StAX parser (the one that comes with J2EE 5). DOM > and SAX doesn't cut it cause the RecordReader interface needs the > ability to "pull" record by record. I don't understand the problem. You should be able to implement the RecordReader interface with a SAX parser. > 1. FileSplit - I'm not sure if I should even try to implement this > capability. I'm working off the LineRecordReader example, and the > low level manipulation of bytes seem really tricky. With StAX, I'm > not able to track where in the file I've read up to, so I'm unable > to figure out when to stop parsing a section of the file. The only > way that I can see this work is to "extend" my own version of > BufferInputStream to track how many bytes have been read. WIth XML, you can't really start reading in the middle of the file. So I don't see any way to handle file splits that are less than a full file. > 2. Should I even bother with JAXB ? If its cumbersome, then I'd > rather not use it. Alternatively, when calling "next", the > application returns a single record represented by XML. I think JAXB would be overkill. A simple SAX parser should be fine, I think... > 4. I'm I re-inventing the wheel - has someone else done this ? > Please let me know. I don't think anyone has done it yet. If you can make a generally useful InputFormat, it would be nice to contribute it back. -- Owen Get a sneak peak at messages with a handy reading pane with All new Yahoo! Mail: http://mail.yahoo.ca
