[ 
https://issues.apache.org/jira/browse/CRUNCH-491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307241#comment-14307241
 ] 

Christian Tzolov edited comment on CRUNCH-491 at 2/5/15 2:12 PM:
-----------------------------------------------------------------

It is challenging to track the raw bytes read from the input file while using 
the high level InputStreamReader to read the decoded characters. The raw byte 
count is required for the management of the the FileSplit boundaries logic 
(e.g. while in the middle of an xml element keep reading until the end tag even 
if this crosses the Split end limit). 

Some observations:
- The number of encoded characters read from the InputStreamReader cannot be 
used to compute the number of the raw bytes read from the input file.  
- The InputStreamReader and underlying StreamDecoder don’t expose the number of 
raw bytes read from the input data. 
- The InputStreamReader wraps the FSDataInputStream which provides getPos() 
method. Unfortunately this method won't help because the InputStreamReader 
reads the input data (FSDataInputStream) in chunks (8KB). The 
FSDataInputStream's position will be incremented with 8KB even before  a single 
byte is read by the InputStreamReader. 

It seems that the Pig’s XMLLoader implementation doesn't support encodings 
either.   

To resolve this I've hacked the JDK InputStreamReader and StreamDecoder by 
exposing the count of the bytes processed (see 
CrunchInputStreamReader#readBytesCount(), CrunchStreamDecoder#readBytesCount()).
(Does this violate any JDK license agreements?)

The new XmlRecordReaderTest verifies the correctness of the raw byte position 
and the FileSplit range cases (e.g. read through the split if in the middle of 
an Xml element). 

[~jwills], [~champgm] I'd appreciate if you can review those changes. 
Do you think the encoding capabilities justifies the additional complexity? Or 
should we revert to the original Mahout XmlInptuFormat (e.g. no encoding 
support)? 





was (Author: tzolov):
It is challenging to track the raw bytes read from the input file while using 
the high level InputStreamReader to read the decoded characters. The raw byte 
count is required for the management of the the FileSplit boundaries logic 
(e.g. while in the middle of an xml element keep reading until the end tag even 
if this crosses the Split end limit). 

Some observations:
- The number of encoded characters read from the InputStreamReader cannot be 
used to compute the number of the raw bytes read from the input file.  
- The InputStreamReader and underlying StreamDecoder don’t expose the number of 
raw bytes read from the input data. 
- The InputStreamReader wraps the FSDataInputStream which provides getPos() 
method. Unfortunately this method won't help because the InputStreamReader 
reads the input data (FSDataInputStream) in chunks (8KB). The 
FSDataInputStream's position will be incremented with 8KB even before  a single 
byte is read by the InputStreamReader. 

It seems that the Pig’s XMLLoader implementation doesn't support encodings 
either.   

To resolve this I've hacked the JDK InputStreamReader and StreamDeckoer by 
exposing the count of the bytes processed (see 
CrunchInputStreamReader#readBytesCount(), CrunchStreamDecoder#readBytesCount()).
(Does this violate any JDK license agreements?)

The new XmlRecordReaderTest verifies the correctness of the raw byte position 
and the FileSplit range cases (e.g. read through the split if in the middle of 
an Xml element). 

[~jwills], [~champgm] I'd appreciate if you can review those changes. 
Do you think the encoding capabilities justifies the additional complexity? Or 
should we revert to the original Mahout XmlInptuFormat (e.g. no encoding 
support)? 




> Add an Xml File Source
> ----------------------
>
>                 Key: CRUNCH-491
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-491
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.11.0
>            Reporter: Christian Tzolov
>            Assignee: Christian Tzolov
>            Priority: Minor
>              Labels: inputformat, source, xml
>         Attachments: CRUNCH-491-1.patch, CRUNCH-491.patch, CRUNCH-491b.patch, 
> CRUNCH-491c.patch, CRUNCH-491d.patch
>
>
> Large XML documents that are composed of a repetitive XML elements can be 
> broken into chunks delimited by the start and end tags of those elements.
> The XmlSource should process XML files and extract out the XML between the 
> pre-configured start / end tags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to