[ 
https://issues.apache.org/jira/browse/SQOOP-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308252#comment-14308252
 ] 

Veena Basavaraj edited comment on SQOOP-1579 at 2/5/15 11:33 PM:
-----------------------------------------------------------------

Discussions with [~abec] offline we agreed that there are 2 issues to resolve 
here.

1. IDF handling of the special characters/ bytes has a issue.
one example is :

If a string has a byte 0x5C( which is \ character) we encode it to \\ as per 
the IDF spec
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation

When the text is now converted to object, it should do the reverse i.e if a TO 
side uses a getObjectData(..) method, then it should get back the byte 0x5C in 
the given string
The  {code}toText {code}method in SqoopIDFUtils, that converts the CSV back to 
object format, does not reverse the special characters to the byte encodings, 
it tends to replace the \\n to \n, This is a very important issue we have and 
will be fixed in this JIRA.

Second part of the ticket is to enhance the HDFS Connector design, a new JIRA 
will be added to support it so loading into HIVE is easier.

This JIRA will address the IDF issue and  all relevant unit tests will be 
modified to reflect the correct expected behaviour

{code}
  @Test
  public void testExample4ToString() {
    String test = "'test,\\\"test1'";
    String expectedString = "test,\"test1";
    String toString = toText(test);
    assertEquals(toString, expectedString);
  }
{code}

It is also good to have unit tests for all the special character handling in 
toText and not just newline.


was (Author: vybs):
Discussions with [~abec] offline we agreed that there are 2 issues to resolve 
here.

1. IDF handling of the special characters/ bytes has a issue.
one example is :

If a string has a byte 0x5C( which is \ character) we encode it to \\ as per 
the IDF spec
https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+CSV+Intermediate+representation

When the text is now converted to object, it should do the reverse i.e if a TO 
side uses a getObjectData(..) method, then it should get back the byte 0x5C in 
the given string
The  {code}toText method in SqoopIDFUtils, that converts the CSV back to object 
format, does not reverse the special characters to the byte encodings, it tends 
to replace the \\n to \n, This is a very important issue we have and will be 
fixed in this JIRA.

Second part of the ticket is to enhance the HDFS Connector design, a new JIRA 
will be added to support it so loading into HIVE is easier.

This JIRA will address the IDF issue and  all relevant unit tests will be 
modified to reflect the correct expected behaviour

{code}

  @Test
  public void testExample4ToString() {
    String test = "'test,\\\"test1'";
    String expectedString = "test,\"test1";
    String toString = toText(test);
    assertEquals(toString, expectedString);
  }


{code}

It is also good to have unit tests for all the special character handling in 
toText and not just newline.

> Sqoop2: Data transfer to load into Hive does not work
> -----------------------------------------------------
>
>                 Key: SQOOP-1579
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1579
>             Project: Sqoop
>          Issue Type: Bug
>          Components: sqoop2-hdfs-connector
>            Reporter: Shakun Grover
>            Assignee: Abraham Elmahrek
>             Fix For: 1.99.5
>
>         Attachments: SQOOP-1579.0.patch, SQOOP-1579.1.patch, 
> SQOOP-1579.2.patch
>
>
> When we import many columns(say >20 columns) from RDBMS to HDFS, then Sqoop2 
> inserts a new line in the output file.The newline appears at the end of 
> certain columns.Doesn't seem to appear for every single column.
> When we try to view this data in Hive, it shows NULL at the new line 
> separator  in it.
> As per Abraham,this looks like a problem with unescaping the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to