[ https://issues.apache.org/jira/browse/DRILL-3712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949489#comment-14949489 ]
Deneche A. Hakim edited comment on DRILL-3712 at 10/8/15 10:13 PM: ------------------------------------------------------------------- [~ebegoli] I did the following using the latest master: - I used your script to create a text.psv file - I created a gzipped version of the file (just .gz not tar.gz) - I updated the "psv" definition in my dfs storage plugin like this: {noformat} "psv": { "type": "text", "extensions": [ "tbl", "psv" ], "skipFirstLine": true, "delimiter": "|" } {noformat} Here are the results I get when I query the file: {noformat} 0: jdbc:drill:zk=local> select * from dfs.data.`test.psv.gz`; +--------------------------------------------------------------------------------------------+ | columns | +--------------------------------------------------------------------------------------------+ | ["value A0","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C0"] | | ["value A1","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C1"] | | ["value A2","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C2"] | | ["value A3","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C3"] | | ["value A4","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C4"] | | ["value A5","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C5"] | | ["value A6","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C6"] | | ["value A7","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C7"] | | ["value A8","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C8"] | | ["value A9","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C9"] | +--------------------------------------------------------------------------------------------+ 10 rows selected (0.136 seconds) {noformat} {noformat} 0: jdbc:drill:zk=local> select columns[0], columns[1], columns[2] from dfs.data.`test.psv.gz`; +-----------+---------------------+-----------+ | EXPR$0 | EXPR$1 | EXPR$2 | +-----------+---------------------+-----------+ | value A0 | Encoded B | value C0 | | value A1 | Encoded B | value C1 | | value A2 | Encoded B | value C2 | | value A3 | Encoded B | value C3 | | value A4 | Encoded B | value C4 | | value A5 | Encoded B | value C5 | | value A6 | Encoded B | value C6 | | value A7 | Encoded B | value C7 | | value A8 | Encoded B | value C8 | | value A9 | Encoded B | value C9 | +-----------+---------------------+-----------+ 10 rows selected (0.194 seconds) {noformat} Do you have more details about how to reproduce the issues you are seeing ? was (Author: adeneche): [~ebegoli] I did the following using the latest master: - I used your script to create a text.psv file - I created a gzipped version of the file (just .gz not tar.gz) - I updated the "psv" definition in my dfs storage plugin like this: {noformat} "psv": { "type": "text", "extensions": [ "tbl", "psv" ], "skipFirstLine": true, "delimiter": "|" } {noformat} Here are the results I get when I query the file: {noformat} 0: jdbc:drill:zk=local> select * from dfs.data.`test.psv.gz`; +--------------------------------------------------------------------------------------------+ | columns | +--------------------------------------------------------------------------------------------+ | ["value A0","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C0"] | | ["value A1","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C1"] | | ["value A2","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C2"] | | ["value A3","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C3"] | | ["value A4","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C4"] | | ["value A5","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C5"] | | ["value A6","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C6"] | | ["value A7","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C7"] | | ["value A8","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C8"] | | ["value A9","E\u0000n\u0000c\u0000o\u0000d\u0000e\u0000d\u0000 \u0000B\u0000","value C9"] | +--------------------------------------------------------------------------------------------+ 10 rows selected (0.136 seconds) {noformat} {noformat} 0: jdbc:drill:zk=local> select columns[0], columns[1], columns[2] from dfs.data.`test.psv.gz`; +-----------+---------------------+-----------+ | EXPR$0 | EXPR$1 | EXPR$2 | +-----------+---------------------+-----------+ | value A0 | Encoded B | value C0 | | value A1 | Encoded B | value C1 | | value A2 | Encoded B | value C2 | | value A3 | Encoded B | value C3 | | value A4 | Encoded B | value C4 | | value A5 | Encoded B | value C5 | | value A6 | Encoded B | value C6 | | value A7 | Encoded B | value C7 | | value A8 | Encoded B | value C8 | | value A9 | Encoded B | value C9 | +-----------+---------------------+-----------+ 10 rows selected (0.194 seconds) {noformat} > Drill does not recognize UTF-16-LE encoding > ------------------------------------------- > > Key: DRILL-3712 > URL: https://issues.apache.org/jira/browse/DRILL-3712 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Text & CSV > Affects Versions: 1.1.0 > Environment: OSX, likely Linux. > Reporter: Edmon Begoli > Fix For: Future > > > We are unable to process files that OSX identifies as character sete UTF16LE. > After unzipping and converting to UTF8, we are able to process one fine. > There are CONVERT_TO and CONVERT_FROM commands that appear to address the > issue, but we were unable to make them work on a gzipped or unzipped version > of the UTF16 file. We were able to use CONVERT_FROM ok, but when we tried > to wrap the results of that to cast as a date, or anything else, it failed. > Trying to work with it natively caused the double-byte nature to appear (a > substring 1,4 only return the first two characters). > I cannot post the data because it is proprietary in nature, but I am posting > this code that might be useful in re-creating an issue: > {noformat} > #!/usr/bin/env python > """ Generates a test psv file with some text fields encoded as UTF-16-LE. """ > def write_utf16le_encoded_psv(): > total_lines = 10 > encoded = "Encoded B".encode("utf-16-le") > with open("test.psv","wb") as csv_file: > csv_file.write("header 1|header 2|header 3\n") > for i in xrange(total_lines): > csv_file.write("value > A"+str(i)+"|"+encoded+"|value C"+str(i)+"\n") > if __name__ == "__main__": > write_utf16le_encoded_psv() > {noformat} > then: > tar zcvf test.psv.tar.gz test.psv -- This message was sent by Atlassian JIRA (v6.3.4#6332)