[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

2017-10-17 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/936
  
@paul-rogers 
agree with you that charsets used in saffron properties should be defaulted 
in Drill to `UTF-8` since Drill can read UTF-8 data and it's strange that it 
would fail by default when Calcite will attempt to parse string into literal 
used in query.

I have looked into Calcite code and there is no option to hard-code charset 
values for Calcite but charset can be changed using properties.
There are two options of setting saffron properties:
1. as system property;
2. using `saffron.properties` file.

I don't really like passing them as `-D` when starting the drillbit 9since 
there are at least two), so I am more inclined to use `saffron.properties` 
file. Unfortunately, in Calcite code `saffron.properties` location is expected 
to be working folder [1], i.e. the place where java process was started. I have 
created Jira and pull request in Calcite to allow `saffron.properties` to be 
present in classpath since it's more convenient [2]. I'll keep you updated on 
Calcite community feedback.

[1] 
https://github.com/apache/calcite/blob/master/core/src/main/java/org/apache/calcite/util/SaffronProperties.java#L113

[2] https://issues.apache.org/jira/browse/CALCITE-2014


---


[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

2017-09-13 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/936
  
@arina-ielchiieva, my comments boil down to two points:

1. What happens if UTF-8 support in Drill is disabled? Do we have 
consistent handling of character data? Do we revert to ASCII? The character set 
on the Drillbit? How is this reconciled with the character set on the Web and 
the client?
2. On the other hand, if Drill handles only UTF-8 (with bugs), does it make 
sense to disable that support if the alternative is undefined or more broken 
than the UTF-8 support?

In short: shouldn't UTF-8 be the hard-coded by Drill when working with 
Calcite to avoid these ambiguities? 


---


[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

2017-09-12 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/936
  
@arina-ielchiieva, thanks for the explanation. Drill's runtime framework 
assumes that data is either:

1. ASCII (or, at least, single-byte character set based on ASCII), or
2. UTF-8 (data is converted to/from UTF-8 when converting VarChar to a Java 
String.)

Since Drill's code seems to assume ASCII (when it cares about character 
format), then one could claim that Drill does not have an encoding: it just 
treats characters as bytes. But, things such as determining string length, 
doing pattern matching, and so on must be aware of the character set -- if only 
to know which bytes are continuations of a multi-byte character. (That is, a 
three-byte sequence in UTF-8 might be one, two or three characters, depending.)

Now, if the planner assumes ISO-8859-1, but the Drill execution engine 
assumes UTF-8, then string constants passed from one to the other can become 
corrupted in the case where a particular byte sequence in ISO-8859-1 represents 
a different character than that same byte sequence in UTF-8.

Where would this occur? Look at the 
[ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) definition. ISO-8859 
is a single-byte character set with meanings associated to the bytes in the 
range 0x40 to 0x7f. But, in UTF-8, the high bit indicates a prefix character. 
So, 0xF7 is a valid single-byte character in ISO-8859, but is a lead-in 
character in UTF-8.

The point here is that setting the character set would seem to be a global 
setting. If the Saffron setting is purely for the parser (how to interpret 
incoming text), and the parser always produces Java strings in UTF-16 (which 
are then encoded into UTF-8 for execution), then we're fine.

But, if the parser encoding is written as bytes sent to the execution 
engine, we're in trouble.

Further, Drill has a web UI. The typical web character set is UTF-8, so 
queries coming from the web UI are encoded in UTF-8.

All this suggests two things:

1. Drill should either always accept UTF-8 (the Saffron property should 
always be set.) or
2. The property is specified by the client and used to decode the bytes 
within a Protobuf message to produce a character stream given to the parser.

It appears that UTF-8 is the default Protobuf String type encoding; sender 
and receiver would have to agree on another format. Does Drill have such an RPC 
property? I've not seen it, but I'm not an expert.

In short, if this change ensures that the parser *always* uses UTF-8, then 
this is good. If the character encoding is an option, then we have to consider 
all the above issues to have a fully working, end-to-end solution.


---


[GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...

2017-09-12 Thread arina-ielchiieva
Github user arina-ielchiieva commented on the issue:

https://github.com/apache/drill/pull/936
  
@paul-rogers, the unit tests just show which influence saffron property has 
on Drill (since people in the community where asking how to enable support 
UTF-8 in Drill), a long with this PR we'll also add description to Drill 
documentation. 

So far Drill relies on Calcite saffron property to determine which charset 
it supports. By default, it's ISO-8859-1. So currently if customer wants to 
query UTF-8 data in Drill, he will get an error (one of the unit tests shows 
it) and needs to override saffron property to proceed. I guess, we don't 
support UTF-8 by default since there are some issues and Drill did not fully 
tested UTF-8 support.


---