[ 
https://issues.apache.org/jira/browse/DRILL-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim updated DRILL-3712:
------------------------------------
    Description: 
We are unable to process files that OSX identifies as character sete UTF16LE.  
After unzipping and converting to UTF8, we are able to process one fine.  There 
are CONVERT_TO and CONVERT_FROM commands that appear to address the issue, but 
we were unable to make them work on a gzipped or unzipped version of the UTF16 
file.  We were  able to use CONVERT_FROM ok, but when we tried to wrap the 
results of that to cast as a date, or anything else, it failed.  Trying to work 
with it natively caused the double-byte nature to appear (a substring 1,4 only 
return the first two characters).

I cannot post the data because it is proprietary in nature, but I am posting 
this code that might be useful in re-creating an issue:

{code}
#!/usr/bin/env python
""" Generates a test psv file with some text fields encoded as UTF-16-LE. """
def write_utf16le_encoded_psv():
        total_lines = 10
        encoded = "Encoded B".encode("utf-16-le")
        with open("test.psv","wb") as csv_file:
                csv_file.write("header 1|header 2|header 3\n")
                for i in xrange(total_lines):
                                csv_file.write("value 
A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")

if __name__ == "__main__":
        write_utf16le_encoded_psv()
{code}


then:

tar zcvf test.psv.tar.gz test.psv




  was:
We are unable to process files that OSX identifies as character sete UTF16LE.  
After unzipping and converting to UTF8, we are able to process one fine.  There 
are CONVERT_TO and CONVERT_FROM commands that appear to address the issue, but 
we were unable to make them work on a gzipped or unzipped version of the UTF16 
file.  We were  able to use CONVERT_FROM ok, but when we tried to wrap the 
results of that to cast as a date, or anything else, it failed.  Trying to work 
with it natively caused the double-byte nature to appear (a substring 1,4 only 
return the first two characters).

I cannot post the data because it is proprietary in nature, but I am posting 
this code that might be useful in re-creating an issue:


#!/usr/bin/env python
""" Generates a test psv file with some text fields encoded as UTF-16-LE. """
def write_utf16le_encoded_psv():
        total_lines = 10
        encoded = "Encoded B".encode("utf-16-le")
        with open("test.psv","wb") as csv_file:
                csv_file.write("header 1|header 2|header 3\n")
                for i in xrange(total_lines):
                                csv_file.write("value 
A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")

if __name__ == "__main__":
        write_utf16le_encoded_psv()


then:

tar zcvf test.psv.tar.gz test.psv





> Drill does not recognize UTF-16-LE encoding
> -------------------------------------------
>
>                 Key: DRILL-3712
>                 URL: https://issues.apache.org/jira/browse/DRILL-3712
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Text & CSV
>    Affects Versions: 1.1.0
>         Environment: OSX, likely Linux. 
>            Reporter: Edmon Begoli
>             Fix For: Future
>
>
> We are unable to process files that OSX identifies as character sete UTF16LE. 
>  After unzipping and converting to UTF8, we are able to process one fine.  
> There are CONVERT_TO and CONVERT_FROM commands that appear to address the 
> issue, but we were unable to make them work on a gzipped or unzipped version 
> of the UTF16 file.  We were  able to use CONVERT_FROM ok, but when we tried 
> to wrap the results of that to cast as a date, or anything else, it failed.  
> Trying to work with it natively caused the double-byte nature to appear (a 
> substring 1,4 only return the first two characters).
> I cannot post the data because it is proprietary in nature, but I am posting 
> this code that might be useful in re-creating an issue:
> {code}
> #!/usr/bin/env python
> """ Generates a test psv file with some text fields encoded as UTF-16-LE. """
> def write_utf16le_encoded_psv():
>       total_lines = 10
>       encoded = "Encoded B".encode("utf-16-le")
>       with open("test.psv","wb") as csv_file:
>               csv_file.write("header 1|header 2|header 3\n")
>               for i in xrange(total_lines):
>                               csv_file.write("value 
> A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")
> if __name__ == "__main__":
>       write_utf16le_encoded_psv()
> {code}
> then:
> tar zcvf test.psv.tar.gz test.psv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to