Khurram Faraaz wrote:
... It looks like Drill processes
non-printable characters in both cases, with and without the new text
reader (exec.storage.enable_new_text_reader)
Should we throw an error since these are non-printable characters ?
No, I don't think so. Does there seem to be any need to reject non-printable
characters?
...
Content from the csv file used in test
1,^A
2,^B
3,^C
4,^D
5,^E
6,^F
0: jdbc:drill:schema=dfs.tmp> select * from `nonPrintables.csv`;
+-----------------+
| columns |
+-----------------+
| ["1","\u0001"] |
| ["2","\u0002"] |
| ["3","\u0003"] |
| ["4","\u0004"] |
| ["5","\u0005"] |
| ["6","\u0006"] |
+-----------------+
6 rows selected (0.521 seconds)
0: jdbc:drill:schema=dfs.tmp> select columns[1] from `nonPrintables.csv`;
+---------+
| EXPR$0 |
+---------+
| |
| |
| |
| |
| |
| |
+---------+
6 rows selected (0.382 seconds)
Note what's going on there (re the difference between those two outputs):
In the first case, the strings with unprintable characters go through Drill's
conversion of a value of a complex type (e.g., VARCHAR ARRAY) to a JSON string
(in order to have a string to return through the JDBC API). That conversion
encodes string (VARCHAR) values as JSON string tokens, using JSON's escape
sequences for the unprintable characters. Finally, the resultant JSON string
(the whole string of JSON, not the JSON string token) is displayed by SQLLine
or the web UI or whatever. (And don't forget the step of your copying and
pasting into your message.)
In the second case, the core part of Drill is directly returning the characters
strings from the data through the JDBC API. Then, SQLLine or the web UI or
whatever is deciding how to display those strings--including how handle any
special, e.g., unprintable, characters. Evidently, SQLLine doesn't render
unprintable characters into some visible form. It probably just writes them to
your terminal's output stream. Since your terminal doesn't render them
especially either, the characters still aren't visible, and when you copied to
paste to compose your e-mail message, there was nothing from those special
characters to copy.
(Actually, the non-printable characters are slightly visible--note how the six lines with visually
blank values have terminating vertical-bar characters that don't line up with the other terminating
"+" or "|" characters.)
From the point of view of the core part of Drill, it's up to the client of the
JDBC API to decide how to display values, including character string with
unprintable characters. (The JDBC API returns the Java representations (String
objects) of the VARCHAR values.)
However, from the point of view of users, SQLLine (and Drill's web UI too)
should render all values visibly, including character strings with unprintable
characters.
(They should also render byte strings competently, e.g., rendering in hex the
bytes themselves rather than displaying in hex the hash code of the Java byte
array object that contains (a specific copy of) the bytes of the byte
string(!).)
Daniel
--
Daniel Barclay
MapR Technologies