[
https://issues.apache.org/jira/browse/PIG-4617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072508#comment-15072508
]
Rohini Palaniswamy commented on PIG-4617:
-----------------------------------------
This is most likely due to PIG-3865 not handling multi-line XML correctly. We
also encountered issue of all newlines in data being stripped off. PIG-4242 did
fix problems with data being missed and replaced newline with empty string to
avoid data loss. Both the issues are logical errors and easily fixed in the
code. But I see major problem with using LineReader to do the job instead of
the old approach of using own buffer to read data. It will skip records with
newline in them if they occur at the end of the split. So XMLLoader requires
switching back to approach of the buffered reading before PIG-3865, but still
retain the regex matching which is a good feature.
> XML loader is not working fine with pig 0.14 version
> ----------------------------------------------------
>
> Key: PIG-4617
> URL: https://issues.apache.org/jira/browse/PIG-4617
> Project: Pig
> Issue Type: Bug
> Components: piggybank, UI
> Reporter: vijayalakshmi karasani
> Priority: Blocker
>
> My old pig script (to load xml files and to parse)which ran successfully
> through pig 0.13 version is not running with pig 0.14 and throwing
> ava.lang.IndexOutOfBoundsException: start 4, end 2, s.length() 2.
> Out of my 10 xml files, 2 are running fine and rest 8 are not file..All these
> xml files ran successfully with pig 0.13 version. May be in new version, you
> have added more validations for well formed of xml files
> My Code:
> REGISTER '/usr/hdp/current/pig-client/lib/piggybank.jar';
> C = LOAD '/common/data/dia/stepxml/*' using
> org.apache.pig.piggybank.storage.XMLLoader('Product') as (x:char array);
> STORE C into '/common/data/dia/intermediate_xmls/Imn_Unique_both2';
> ERROR:
> 2015-06-30 13:12:28,409 FATAL [IPC Server handler 3 on 34318]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task:
> attempt_1434729076270_34899_m_000015_0 - exited :
> java.lang.IndexOutOfBoundsException: start 4, end 2, s.length() 2
> at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:476)
> at java.lang.StringBuffer.append(StringBuffer.java:309)
> Input(s):
> Failed to read data from "/common/data/dia/stepxml/*"
> Output(s):
> Failed to produce result in
> "/common/data/dia/intermediate_xmls/Imn_Unique_both2"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)