[
https://issues.apache.org/jira/browse/SQOOP-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308252#comment-14308252
]
Veena Basavaraj edited comment on SQOOP-1579 at 2/5/15 11:33 PM:
-----------------------------------------------------------------
Discussions with [~abec] offline we agreed that there are 2 issues to resolve
here.
1. IDF handling of the special characters/ bytes has a issue.
one example is :
If a string has a byte 0x5C( which is \ character) we encode it to \\ as per
the IDF spec
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
When the text is now converted to object, it should do the reverse i.e if a TO
side uses a getObjectData(..) method, then it should get back the byte 0x5C in
the given string
The {code}toText {code}method in SqoopIDFUtils, that converts the CSV back to
object format, does not reverse the special characters to the byte encodings,
it tends to replace the \\n to \n, This is a very important issue we have and
will be fixed in this JIRA.
Second part of the ticket is to enhance the HDFS Connector design, a new JIRA
will be added to support it so loading into HIVE is easier.
This JIRA will address the IDF issue and all relevant unit tests will be
modified to reflect the correct expected behaviour
{code}
@Test
public void testExample4ToString() {
String test = "'test,\\\"test1'";
String expectedString = "test,\"test1";
String toString = toText(test);
assertEquals(toString, expectedString);
}
{code}
It is also good to have unit tests for all the special character handling in
toText and not just newline.
was (Author: vybs):
Discussions with [~abec] offline we agreed that there are 2 issues to resolve
here.
1. IDF handling of the special characters/ bytes has a issue.
one example is :
If a string has a byte 0x5C( which is \ character) we encode it to \\ as per
the IDF spec
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation
When the text is now converted to object, it should do the reverse i.e if a TO
side uses a getObjectData(..) method, then it should get back the byte 0x5C in
the given string
The {code}toText method in SqoopIDFUtils, that converts the CSV back to object
format, does not reverse the special characters to the byte encodings, it tends
to replace the \\n to \n, This is a very important issue we have and will be
fixed in this JIRA.
Second part of the ticket is to enhance the HDFS Connector design, a new JIRA
will be added to support it so loading into HIVE is easier.
This JIRA will address the IDF issue and all relevant unit tests will be
modified to reflect the correct expected behaviour
{code}
@Test
public void testExample4ToString() {
String test = "'test,\\\"test1'";
String expectedString = "test,\"test1";
String toString = toText(test);
assertEquals(toString, expectedString);
}
{code}
It is also good to have unit tests for all the special character handling in
toText and not just newline.
> Sqoop2: Data transfer to load into Hive does not work
> -----------------------------------------------------
>
> Key: SQOOP-1579
> URL: https://issues.apache.org/jira/browse/SQOOP-1579
> Project: Sqoop
> Issue Type: Bug
> Components: sqoop2-hdfs-connector
> Reporter: Shakun Grover
> Assignee: Abraham Elmahrek
> Fix For: 1.99.5
>
> Attachments: SQOOP-1579.0.patch, SQOOP-1579.1.patch,
> SQOOP-1579.2.patch
>
>
> When we import many columns(say >20 columns) from RDBMS to HDFS, then Sqoop2
> inserts a new line in the output file.The newline appears at the end of
> certain columns.Doesn't seem to appear for every single column.
> When we try to view this data in Hive, it shows NULL at the new line
> separator in it.
> As per Abraham,this looks like a problem with unescaping the data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)