Edmon Begoli created DRILL-3712:
-----------------------------------
Summary: Drill does not recognize UTF-16-LE encoding
Key: DRILL-3712
URL: https://issues.apache.org/jira/browse/DRILL-3712
Project: Apache Drill
Issue Type: Bug
Components: Storage - Text & CSV
Affects Versions: 1.1.0
Environment: OSX, likely Linux.
Reporter: Edmon Begoli
Assignee: Steven Phillips
We are unable to process files that OSX identifies as character sete UTF16LE.
After unzipping and converting to UTF8, we are able to process one fine. There
are CONVERT_TO and CONVERT_FROM commands that appear to address the issue, but
we were unable to make them work on a gzipped or unzipped version of the UTF16
file. We were able to use CONVERT_FROM ok, but when we tried to wrap the
results of that to cast as a date, or anything else, it failed. Trying to work
with it natively caused the double-byte nature to appear (a substring 1,4 only
return the first two characters).
I cannot post the data because it is proprietary in nature, but I am posting
this code that might be useful in re-creating an issue:
#!/usr/bin/env python
""" Generates a test psv file with some text fields encoded as UTF-16-LE. """
def write_utf16le_encoded_psv():
total_lines = 10
encoded = "Encoded B".encode("utf-16-le")
with open("test.psv","wb") as csv_file:
csv_file.write("header 1|header 2|header 3\n")
for i in xrange(total_lines):
csv_file.write("value
A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n")
if __name__ == "__main__":
write_utf16le_encoded_psv()
then:
tar zcvf test.psv
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)