[jira] [Created] (HIVE-5506) Hive SPLIT function does not return array correctly
John Omernik created HIVE-5506: -- Summary: Hive SPLIT function does not return array correctly Key: HIVE-5506 URL: https://issues.apache.org/jira/browse/HIVE-5506 Project: Hive Issue Type: Bug Components: SQL, UDF Affects Versions: 0.11.0, 0.10.0, 0.9.0 Environment: Hive Reporter: John Omernik Hello all, I think I have outlined a bug in the hive split function: Summary: When calling split on a string of data, it will only return all array items if the the last array item has a value. For example, if I have a string of text delimited by tab with 7 columns, and the first four are filled, but the last three are blank, split will only return a 4 position array. If any number of "middle" columns are empty, but the last item still has a value, then it will return the proper number of columns. This was tested in Hive 0.9 and hive 0.11. Data: (Note \t represents a tab char, \x09 the line endings should be \n (UNIX style) not sure what email will do to them). Basically my data is 7 lines of data with the first 7 letters separated by tab. On some lines I've left out certain letters, but kept the number of tabs exactly the same. input.txt a\tb\tc\td\te\tf\tg a\tb\tc\td\te\t\tg a\tb\t\td\t\tf\tg \t\t\td\te\tf\tg a\tb\tc\td\t\t\t a\t\t\t\te\tf\tg a\t\t\td\t\t\tg I then created a table with one column from that data: DROP TABLE tmp_jo_tab_test; CREATE table tmp_jo_tab_test (message_line STRING) STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/tmp/input.txt' OVERWRITE INTO TABLE tmp_jo_tab_test; Ok just to validate I created a python counting script: #!/usr/bin/python import sys for line in sys.stdin: line = line[0:-1] out = line.split("\t") print len(out) The output there is : $ cat input.txt |./cnt_tabs.py 7 7 7 7 7 7 7 Based on that information, split on tab should return me 7 for each line as well: hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;" 7 7 7 7 4 7 7 However it does not. It would appear that the line where only the first four letters are filled in(and blank is passed in on the last three) only returns 4 splits, where there should technically be 7, 4 for letters included, and three blanks. a\tb\tc\td\t\t\t -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive
[ https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667142#comment-13667142 ] John Omernik commented on HIVE-4070: A couple of points to consider: 1. " If we change the default that would change the results current users are getting." I am in absolute agreement here. I do not wish to change the default, the horse has left the gate. If I have referenced that in my previous posts, it's in the context of other ideas etc. My issue with the current default isn't the choice of what it is, as much as there is no way to change it on a case by base basis. 2. Paragraph 2: Those are solid points based on risks to the ongoing maintenance of Hive. That said, most Relational Databases, including MySQL have the option to change collation for given databases. True it may not be global setting (although the collation default is a global setting), but rather a per database setting. I.e. in MySQL you can have one database use a case sensitive collation and another use a case insensitive collation. Perhaps hive-site isn't the place for this, but metadata is (Set it as a per database or table setting?) 3. For the most part, we model functionality in hive based on what mysql does - Because of Point 1, this is moot, but by default, MySQL is case insensitive. 4. Most users do not want to have to heavily test before upgrade, they want consistent behaviour between versions. - This is exactly why I think we need an option for administrators of Hive to be able to set the case sensitivity on database/table level, and the default case sensitivity at a global level. Most users do not test. From a philosophical (data philosopher?) point of view, users will not test, therefore if failure is to happen because of lack of testing, please fail with pomp and circumstance. I.e. Both lines of users (MySQL/MSSQL Migrating to Hive and Oracle/Postgres Migrating to Hive) are going to make assumptions when they write their queries that will cause a failure to happen. If LIKE is Case Sensitive as it is, Oracle/Postgres users will assume correctly and all will be well. MySQL/Users will assume wrong, but they won't KNOW they assumed wrong, they will just assume no results (given a lack of testing). The converse is actually the preferred model: If LIKE is not case sensitive, then Oracle/Postgres users will assume incorrectly, they will run a query, and they will get their expected results, but will also get extra results allowing them to understand they assumed wrong. MySQL/MSSQL users will assume correctly and get their results. Going back to Point 1, the horse has left the building, we can't change default because of what it would do to the current user base, thus the next best option is to allow administrators to set that so users don't have to deal with it, document it so good administrators can handle it out of the gate, and to find a way to do it like '%MysQL%' (i.e. similar to collation settings on databases rather then a global setting as described previously). I do see the challenges with a global setting affecting how a function works; no precedence for that, and thus it could introduce risk, but I still hold that the current risk, especially on a system that is touted as modeled after MySQL (that's how it was explained to me, thus I assumed LIKE was case insensitive), to users is high. As I was typing this novelette, I realized I tossed out an idea related to per database settings. Thoughts on that? That is similar to how MySQL handles it, and thus follows the models mentioned while avoiding a global setting that affects the behavior of a UDF. (The setting doesn't change the UDF, the "collation equivalent" setting on the databases does, and thus the global setting is just the default collation equivalent. > Like operator in Hive is case sensitive while in MySQL (and most likely other > DBs) it's case insensitive > > > Key: HIVE-4070 > URL: https://issues.apache.org/jira/browse/HIVE-4070 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 0.10.0 >Reporter: Mark Grover >Assignee: Mark Grover >Priority: Trivial > > Hive's like operator seems to be case sensitive. > See > https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164 > However, MySQL's like operator is case insensitive. I don't have other DB's > (like PostgreSQL) installed and handy but I am guessing their LIKE is case > insensitive as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA
[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive
[ https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13666844#comment-13666844 ] John Omernik commented on HIVE-4070: Edward and Mark: Can you please explain risky (from a programming PoV)in the situations you are talking about? I am not a programmer, so I am looking at this from both and end-user and from the point of view of an organization that is going to implement Hive to assist with business decisions. When I see the two risks, i.e. my understanding of the programming risk (Which I am admitting I have very little understanding, hence the asking for the explanation) vs. the implementation risk i.e. organizations that are migrating to Hive, I worry that those decisions, which are based on false negatives because of the current choices (case sensitivity), could be very bad from a business/implementation risk standpoint. Please do not take my persistent questioning as being combative, I am just trying to look at this purely from a user/organization point of view. I.e. training people to do things different, the results from poor training or employee turnover, etc causing these false negatives on data. There has to be an elegant way to give the organization control over this... even documenting such a setting for administrators is nice, but it still relies on user training/choices to ensure business decisions are not based on false negatives. If we had a setting we could change, we could point out the difference, and give users/org and option that fits with each org's use case. As I said, just trying to understand about the risk on the programming side, I see the risk, as a power user, administrator etc as having a high likelihood of occurring, and the impact of said risk (false negatives) as being high as well. That said, I defer to you guys on the programming side. > Like operator in Hive is case sensitive while in MySQL (and most likely other > DBs) it's case insensitive > > > Key: HIVE-4070 > URL: https://issues.apache.org/jira/browse/HIVE-4070 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 0.10.0 >Reporter: Mark Grover >Assignee: Mark Grover >Priority: Trivial > > Hive's like operator seems to be case sensitive. > See > https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164 > However, MySQL's like operator is case insensitive. I don't have other DB's > (like PostgreSQL) installed and handy but I am guessing their LIKE is case > insensitive as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive
[ https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13666713#comment-13666713 ] John Omernik commented on HIVE-4070: They may be a bit scary, but from what perspective? The programming of the system or the users/operators of the data warehouse that need consistency in results? The scary thing about the LIKE operator and it not being able to be controlled in it's behavior is ensuring that assumptions by users don't lead to false negative hits. I know personally, I came from MySQl/MSSQL and I had that happened to me until I figured out the difference. It isn't well documented, I can't control it, it bit me. Ok, no problem, how do I fix it now in my environment, most of my users are from that same background, so now I have to do training to change this? How effective will that be? Will people forget? Will I get false negatives? A "mlike" may be an option, however, I am not sure; it still requires the user to remember there is a change, is mlike better than other work arounds? Not if the user forgets to use it and in the case of like, the end result is false negatives(user forgets, no results, moves on not realizing they are missing something). If a global option isn't available, could like be case insensitive and an olike function (oracle like) be added? I will say that while false positives are a pain, at least you can look at the results and say "wait, why isn't this case sensitive?" and you know something is different than what you are used to (i.e. Oracle, hence olike). You see and then understand there is a discrepancy. With mlike and case sensitivity being the default, you don't have any indicator, because unexpected case sensitivity produces false negatives, unexpected case insensitivity produces false positives. In analytics false positives are far preferred than false negatives. Just brainstorming here, what if we implemented mlike AND olike (my/ms SQL like and oracle like) (or cslike and cilike (case sensitive and case insensitive)) Then we have the like function just be a link to mlike or olike. Since Hive started out with case sensitivity, we could have like link to olike, but it can be changed to mlike in the hive-site. The point being, absence of specification, default to normal (LIKE = OLIKE Case sensitive) hive.like.insensitive = true make it MLIKE. Simple, doesn't change the default hive setting/behavior (out of the box), gives the option of global setting, and from a programing standpoint we are doing an if test on what LIKE links to, and we'd have olike and mlike available as UDFs when a user wanted to do the opposite of what the default setting is. > Like operator in Hive is case sensitive while in MySQL (and most likely other > DBs) it's case insensitive > > > Key: HIVE-4070 > URL: https://issues.apache.org/jira/browse/HIVE-4070 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 0.10.0 >Reporter: Mark Grover >Assignee: Mark Grover >Priority: Trivial > > Hive's like operator seems to be case sensitive. > See > https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164 > However, MySQL's like operator is case insensitive. I don't have other DB's > (like PostgreSQL) installed and handy but I am guessing their LIKE is case > insensitive as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive
[ https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13666278#comment-13666278 ] John Omernik commented on HIVE-4070: What about this? Leave it case sensitive as it is, but provide a HIVE Variable that allows it to be set globally. That way, an administrator can set the behaivior of LIKE based on what their users are migrating in from. I.e. if MSSQL or MYSQL then set it to be case insensitive, else use the default of case sensitive. The issue here is one of transitioning and potential false negatives because of the assumption. Correct, you can't be consistent with every SQL implementation out there, however, I think there is some precedence for being LIKE '%mySql%'(see what I did there) which is case insensitive by default. > Like operator in Hive is case sensitive while in MySQL (and most likely other > DBs) it's case insensitive > > > Key: HIVE-4070 > URL: https://issues.apache.org/jira/browse/HIVE-4070 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 0.10.0 >Reporter: Mark Grover >Assignee: Mark Grover >Priority: Trivial > > Hive's like operator seems to be case sensitive. > See > https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164 > However, MySQL's like operator is case insensitive. I don't have other DB's > (like PostgreSQL) installed and handy but I am guessing their LIKE is case > insensitive as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-3979) Provide syntax for unescaped regex on rlike, and other regexp_* functions
John Omernik created HIVE-3979: -- Summary: Provide syntax for unescaped regex on rlike, and other regexp_* functions Key: HIVE-3979 URL: https://issues.apache.org/jira/browse/HIVE-3979 Project: Hive Issue Type: New Feature Components: UDF Affects Versions: 0.10.0 Environment: All Queries, JDBC and cli Reporter: John Omernik Fix For: 0.11.0 To make hive more intuitive, can we provide a syntax for hive functions that when encompassing a regex, that regex doesn't have to the standard hive double escape? For example column rlike '\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}' could become column rlike '/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/' In this case, just using PERL's regex brackets, but could be anything... The point is to be able to take a standard JAVA/Perl type regex and cut and paste it into a query. This would allow for faster query development, more mainstream adoption by those with regex experience, and intuitive transition to hive query writing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira