[ https://issues.apache.org/jira/browse/CRUNCH-491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christian Tzolov updated CRUNCH-491: ------------------------------------ Attachment: CRUNCH-491f.patch Apparently the assumption that one can't re-compute the original (raw) byte size for a given characters is not correct. Following [~champgm] suggestion to use CharsetEncoder seems to do the trick. {code} public static class XmlRecordReader extends RecordReader<LongWritable, Text> { ... private int calculateCharacterByteLength(final char character) { try { return charsetEncoder.encode(CharBuffer.wrap(new char[] { character })).limit(); } catch (final CharacterCodingException e) { throw new RuntimeException(inputEncoding); } } } {code} > Add an Xml File Source > ---------------------- > > Key: CRUNCH-491 > URL: https://issues.apache.org/jira/browse/CRUNCH-491 > Project: Crunch > Issue Type: New Feature > Components: Core > Affects Versions: 0.11.0 > Reporter: Christian Tzolov > Assignee: Christian Tzolov > Priority: Minor > Labels: inputformat, source, xml > Attachments: CRUNCH-491-1.patch, CRUNCH-491.patch, CRUNCH-491b.patch, > CRUNCH-491c.patch, CRUNCH-491d.patch, CRUNCH-491f.patch > > > Large XML documents that are composed of a repetitive XML elements can be > broken into chunks delimited by the start and end tags of those elements. > The XmlSource should process XML files and extract out the XML between the > pre-configured start / end tags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)