[
https://issues.apache.org/jira/browse/SQOOP-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14124387#comment-14124387
]
Peter Hannam edited comment on SQOOP-1495 at 9/6/14 7:55 AM:
-------------------------------------------------------------
Have attached a patch which adds a check that the enclosing and escaping
parameters aren't 'default' and if they are, to definitely ignore them.
was (Author: petehannam):
diff --git a/src/java/org/apache/sqoop/lib/RecordParser.java
b/src/java/org/apache/sqoop/lib/RecordParser.java
index 7c29151..377d07b 100644
--- a/src/java/org/apache/sqoop/lib/RecordParser.java
+++ b/src/java/org/apache/sqoop/lib/RecordParser.java
@@ -230,7 +230,10 @@ public RecordParser(final
com.cloudera.sqoop.lib.DelimiterSet delimitersIn) {
char recordDelim = delimiters.getLinesTerminatedBy();
char escapeChar = delimiters.getEscapedBy();
boolean enclosingRequired = delimiters.isEncloseRequired();
-
+ boolean enclosingAllowed = enclosingChar !=
com.cloudera.sqoop.lib.DelimiterSet.NULL_CHAR;
+ boolean escapeAllowed = escapeChar !=
com.cloudera.sqoop.lib.DelimiterSet.NULL_CHAR;
+
+
for (int pos = 0; pos < len; pos++) {
curChar = input.get();
switch (state) {
@@ -242,10 +245,10 @@ public RecordParser(final
com.cloudera.sqoop.lib.DelimiterSet delimitersIn) {
}
sb = new StringBuilder();
- if (enclosingChar == curChar) {
+ if (enclosingAllowed && enclosingChar == curChar) {
// got an opening encloser.
state = ParseState.ENCLOSED_FIELD;
- } else if (escapeChar == curChar) {
+ } else if (escapeAllowed && escapeChar == curChar) {
state = ParseState.UNENCLOSED_ESCAPE;
} else if (fieldDelim == curChar) {
// we have a zero-length field. This is a no-op.
@@ -267,7 +270,7 @@ public RecordParser(final
com.cloudera.sqoop.lib.DelimiterSet delimitersIn) {
break;
case ENCLOSED_FIELD:
- if (escapeChar == curChar) {
+ if (escapeAllowed && escapeChar == curChar) {
// the next character is escaped. Treat it literally.
state = ParseState.ENCLOSED_ESCAPE;
} else if (enclosingChar == curChar) {
@@ -282,7 +285,7 @@ public RecordParser(final
com.cloudera.sqoop.lib.DelimiterSet delimitersIn) {
break;
case UNENCLOSED_FIELD:
- if (escapeChar == curChar) {
+ if (escapeAllowed && escapeChar == curChar) {
// the next character is escaped. Treat it literally.
state = ParseState.UNENCLOSED_ESCAPE;
} else if (fieldDelim == curChar) {
diff --git a/src/test/com/cloudera/sqoop/lib/TestRecordParser.java
b/src/test/com/cloudera/sqoop/lib/TestRecordParser.java
index 8b11d39..ab76ed5 100644
--- a/src/test/com/cloudera/sqoop/lib/TestRecordParser.java
+++ b/src/test/com/cloudera/sqoop/lib/TestRecordParser.java
@@ -409,5 +409,13 @@ public void testRepeatedParse() throws
RecordParser.ParseError {
assertListsEqual(null, list(strings2),
parser.parseRecord("foo,\"bar\""));
}
+
+ public void testTwoFieldsWithQuoteBeforeDelim() throws
RecordParser.ParseError {
+ char[] input = new char[] {'A', (char) 0, '|', 'B'};
+
+ RecordParser parser = new RecordParser(new DelimiterSet('|', '\n', '\0',
'\0', false));
+ String[] strings = {"A\u0000", "B"};
+ assertListsEqual(null, list(strings), parser.parseRecord(input));
+ }
}
> EnclosedBy and EscapedBy set to \000 are not ignored
> ----------------------------------------------------
>
> Key: SQOOP-1495
> URL: https://issues.apache.org/jira/browse/SQOOP-1495
> Project: Sqoop
> Issue Type: Bug
> Affects Versions: 1.4.5
> Reporter: Peter Hannam
> Priority: Minor
> Attachments: patch.diff
>
>
> In {{DelimiterSet}} there is the following comment above two option variables:
> {code:java}
> // If these next two fields are '\000', then they are ignored.
> private char enclosedBy;
> private char escapedBy;
> {code}
> We just found a problem with this whilst doing a Sqoop export, without
> setting the parameters for enclosing or escaping (i.e. they're left as
> default \000). Looking at the code in {{RecordParser}} it appears that
> although the comment says they would be ignored if set to \000 they actually
> aren't.
> For some reason some of the records we're trying to export have \000 in a
> column. This is fine as long as the \000 isn't just before the delimiter.
> This is fine {{foo\000bar|moo}} - two columns are exported.
> This isn't fine {{foo\000|bar}} - only one column is exported.
> Looking through {{RecordParser}} the problem is that our \000 character is
> being assumed to be an enclosing character, so it's then assuming the
> delimiter is part of a value. We've set {{enclosedBy}} to be \000 as a
> default, let's ignore it value, but then we're encountering \000 and it's
> being picked up.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)