[ https://issues.apache.org/jira/browse/DRILL-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949503#comment-14949503 ]
Steven Phillips commented on DRILL-3712: ---------------------------------------- I think one solution would be to write a UDF to convert from utf16 to utf8. We already have a function that does the reverse: CastVarCharVar16Char . > Drill does not recognize UTF-16-LE encoding > ------------------------------------------- > > Key: DRILL-3712 > URL: https://issues.apache.org/jira/browse/DRILL-3712 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Text & CSV > Affects Versions: 1.1.0 > Environment: OSX, likely Linux. > Reporter: Edmon Begoli > Fix For: Future > > > We are unable to process files that OSX identifies as character sete UTF16LE. > After unzipping and converting to UTF8, we are able to process one fine. > There are CONVERT_TO and CONVERT_FROM commands that appear to address the > issue, but we were unable to make them work on a gzipped or unzipped version > of the UTF16 file. We were able to use CONVERT_FROM ok, but when we tried > to wrap the results of that to cast as a date, or anything else, it failed. > Trying to work with it natively caused the double-byte nature to appear (a > substring 1,4 only return the first two characters). > I cannot post the data because it is proprietary in nature, but I am posting > this code that might be useful in re-creating an issue: > {noformat} > #!/usr/bin/env python > """ Generates a test psv file with some text fields encoded as UTF-16-LE. """ > def write_utf16le_encoded_psv(): > total_lines = 10 > encoded = "Encoded B".encode("utf-16-le") > with open("test.psv","wb") as csv_file: > csv_file.write("header 1|header 2|header 3\n") > for i in xrange(total_lines): > csv_file.write("value > A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n") > if __name__ == "__main__": > write_utf16le_encoded_psv() > {noformat} > then: > tar zcvf test.psv.tar.gz test.psv -- This message was sent by Atlassian JIRA (v6.3.4#6332)