[jira] [Created] (HIVE-5506) Hive SPLIT function does not return array correctly

2013-10-09 Thread John Omernik (JIRA)
John Omernik created HIVE-5506:
--

 Summary: Hive SPLIT function does not return array correctly
 Key: HIVE-5506
 URL: https://issues.apache.org/jira/browse/HIVE-5506
 Project: Hive
  Issue Type: Bug
  Components: SQL, UDF
Affects Versions: 0.11.0, 0.10.0, 0.9.0
 Environment: Hive
Reporter: John Omernik


Hello all, I think I have outlined a bug in the hive split function:

Summary: When calling split on a string of data, it will only return all array 
items if the the last array item has a value. For example, if I have a string 
of text delimited by tab with 7 columns, and the first four are filled, but the 
last three are blank, split will only return a 4 position array. If  any number 
of "middle" columns are empty, but the last item still has a value, then it 
will return the proper number of columns.  This was tested in Hive 0.9 and hive 
0.11. 

Data:
(Note \t represents a tab char, \x09 the line endings should be \n (UNIX style) 
not sure what email will do to them).  Basically my data is 7 lines of data 
with the first 7 letters separated by tab.  On some lines I've left out certain 
letters, but kept the number of tabs exactly the same.  

input.txt
a\tb\tc\td\te\tf\tg
a\tb\tc\td\te\t\tg
a\tb\t\td\t\tf\tg
\t\t\td\te\tf\tg
a\tb\tc\td\t\t\t
a\t\t\t\te\tf\tg
a\t\t\td\t\t\tg

I then created a table with one column from that data:


DROP TABLE tmp_jo_tab_test;
CREATE table tmp_jo_tab_test (message_line STRING)
STORED AS TEXTFILE;
 
LOAD DATA LOCAL INPATH '/tmp/input.txt'
OVERWRITE INTO TABLE tmp_jo_tab_test;

Ok just to validate I created a python counting script:

#!/usr/bin/python
 
import sys
 
 
for line in sys.stdin:
line = line[0:-1]
out = line.split("\t")
print len(out)

The output there is : 
$ cat input.txt |./cnt_tabs.py
7
7
7
7
7
7
7

Based on that information, split on tab should return me 7 for each line as 
well:

hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;"
 
7
7
7
7
4
7
7

However it does not.  It would appear that the line where only the first four 
letters are filled in(and blank is passed in on the last three) only returns 4 
splits, where there should technically be 7, 4 for letters included, and three 
blanks.  

a\tb\tc\td\t\t\t 




--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive

2013-05-25 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667142#comment-13667142
 ] 

John Omernik commented on HIVE-4070:


A couple of points to consider:
1. " If we change the default that would change the results current users are 
getting." I am in absolute agreement here.  I do not wish to change the 
default, the horse has left the gate. If I have referenced that in my previous 
posts, it's in the context of other ideas etc.  My issue with the current 
default isn't the choice of what it is, as much as there is no way to change it 
on a case by base basis. 

2. Paragraph 2: Those are solid points based on risks to the ongoing 
maintenance of Hive. That said, most Relational Databases, including MySQL have 
the option to change collation for given databases. True it may not be global 
setting (although the collation default is a global setting), but rather a per 
database setting. I.e. in MySQL you can have one database use a case sensitive 
collation and another use a case insensitive collation. Perhaps hive-site isn't 
the place for this, but metadata is (Set it as a per database or table 
setting?)  

3. For the most part, we model functionality in hive based on what mysql does - 
Because of Point 1, this is moot, but by default, MySQL is case insensitive. 

4.  Most users do not want to have to heavily test before upgrade, they want 
consistent behaviour between versions. - This is exactly why I think we need an 
option for administrators of Hive to be able to set the case sensitivity on 
database/table level, and the default case sensitivity at a global level.  Most 
users do not test.  From a philosophical (data philosopher?) point of view, 
users will not test, therefore if failure is to happen because of lack of 
testing, please fail with pomp and circumstance.  

I.e. Both lines of users (MySQL/MSSQL Migrating to Hive and Oracle/Postgres 
Migrating to Hive) are going to make assumptions when they write their queries 
that will cause a failure to happen.  If LIKE is Case Sensitive as it is, 
Oracle/Postgres users will assume correctly and all will be well.  MySQL/Users 
will assume wrong, but they won't KNOW they assumed wrong, they will just 
assume no results (given a lack of testing).   The converse is actually the 
preferred model: If LIKE is not case sensitive, then Oracle/Postgres users will 
assume incorrectly, they will run a query, and they will get their expected 
results, but will also get extra results allowing them to understand they 
assumed wrong. MySQL/MSSQL users will assume correctly and get their results.   

Going back to Point 1, the horse has left the building, we can't change default 
because of what it would do to the current user base, thus the next best option 
is to allow administrators to set that so users don't have to deal with it, 
document it so good administrators can handle it out of the gate, and to find a 
way to do it like '%MysQL%' (i.e. similar to collation settings on databases 
rather then a global setting as described previously).  

I do see the challenges with a global setting affecting how a function works; 
no precedence for that, and thus it could introduce risk, but I still hold that 
the current risk, especially on a system that is touted as modeled after MySQL 
(that's how it was explained to me, thus I assumed LIKE was case insensitive), 
to users is high.  As I was typing this novelette, I realized I tossed out an 
idea related to per database settings.  Thoughts on that? That is similar to 
how MySQL handles it, and thus follows the models mentioned while avoiding a 
global setting that affects the behavior of a UDF.  (The setting doesn't change 
the UDF, the "collation equivalent" setting on the databases does, and thus the 
global setting is just the default collation equivalent. 

> Like operator in Hive is case sensitive while in MySQL (and most likely other 
> DBs) it's case insensitive
> 
>
> Key: HIVE-4070
> URL: https://issues.apache.org/jira/browse/HIVE-4070
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 0.10.0
>Reporter: Mark Grover
>Assignee: Mark Grover
>Priority: Trivial
>
> Hive's like operator seems to be case sensitive.
> See 
> https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164
> However, MySQL's like operator is case insensitive. I don't have other DB's 
> (like PostgreSQL) installed and handy but I am guessing their LIKE is case 
> insensitive as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA

[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive

2013-05-24 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13666844#comment-13666844
 ] 

John Omernik commented on HIVE-4070:


Edward and Mark: Can you please explain risky (from a programming PoV)in the 
situations you are talking about?  I am not a programmer, so I am looking at 
this from both and end-user and from the point of view of an organization that 
is going to implement Hive to assist with business decisions. 

When I see the two risks, i.e. my understanding of the programming risk (Which 
I am admitting I have very little understanding, hence the asking for the 
explanation) vs. the implementation risk i.e. organizations that are migrating 
to Hive, I worry that those decisions, which are based on false negatives 
because of the current choices (case sensitivity), could be very bad from a 
business/implementation risk standpoint. 

Please do not take my persistent questioning as being combative, I am just 
trying to look at this purely from a user/organization point of view. I.e. 
training people to do things different, the results from poor training or 
employee turnover, etc causing these false negatives on data.  There has to be 
an elegant way to give the organization control over this... even documenting 
such a setting for administrators is nice, but it still relies on user 
training/choices to ensure business decisions are not based on false negatives. 
If we had a setting we could change, we could point out the difference, and 
give users/org and option that fits with each org's use case. 

As I said, just trying to understand about the risk on the programming side, I 
see the risk, as a power user, administrator etc as having a high likelihood of 
occurring, and the impact of said risk (false negatives) as being high as well. 
 That said, I defer to you guys on the programming side. 

> Like operator in Hive is case sensitive while in MySQL (and most likely other 
> DBs) it's case insensitive
> 
>
> Key: HIVE-4070
> URL: https://issues.apache.org/jira/browse/HIVE-4070
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 0.10.0
>Reporter: Mark Grover
>Assignee: Mark Grover
>Priority: Trivial
>
> Hive's like operator seems to be case sensitive.
> See 
> https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164
> However, MySQL's like operator is case insensitive. I don't have other DB's 
> (like PostgreSQL) installed and handy but I am guessing their LIKE is case 
> insensitive as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive

2013-05-24 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13666713#comment-13666713
 ] 

John Omernik commented on HIVE-4070:


They may be a bit scary, but from what perspective? The programming of the 
system or the users/operators of the data warehouse that need consistency in 
results?  The scary thing about the LIKE operator and it not being able to be 
controlled in it's behavior is ensuring that assumptions by users don't lead to 
false negative hits.  I know personally, I came from MySQl/MSSQL and I had that 
happened to me until I figured out the difference. It isn't well documented, I 
can't control it, it bit me. Ok, no problem, how do I fix it now in my 
environment, most of my users are from that same background, so now I have to 
do training to change this? How effective will that be? Will people forget? 
Will I get false negatives? 

A "mlike" may be an option, however, I am not sure; it still requires the user 
to remember there is a change, is mlike better than other work arounds? Not if 
the user forgets to use it and in the case of like, the end result is false 
negatives(user forgets, no results, moves on not realizing they are missing 
something).  

If a global option isn't available, could like be case insensitive and an olike 
function (oracle like) be added? I will say that while false positives are a 
pain, at least you can look at the results and say "wait, why isn't this case 
sensitive?" and you know something is different than what you are used to (i.e. 
Oracle, hence olike).  You see and then understand there is a discrepancy. With 
mlike and case sensitivity being the default, you don't have any indicator, 
because unexpected case sensitivity produces false negatives, unexpected case 
insensitivity produces false positives. In analytics false positives are far 
preferred than false negatives.  

Just brainstorming here, what if we implemented mlike AND olike (my/ms SQL like 
and oracle  like) (or cslike and cilike (case sensitive and case insensitive)) 
Then we have the like function just be a link to mlike or olike.  Since Hive 
started out with case sensitivity, we could have like link to olike, but it can 
be changed to mlike in the hive-site.  The point being, absence of 
specification, default to normal (LIKE = OLIKE Case sensitive) 
hive.like.insensitive = true make it MLIKE.  Simple, doesn't change the default 
hive setting/behavior (out of the box), gives the option of global setting, and 
from a programing standpoint we are doing an if test on what LIKE links to, and 
we'd have olike and mlike available as UDFs when a user wanted to do the 
opposite of what the default setting is. 




> Like operator in Hive is case sensitive while in MySQL (and most likely other 
> DBs) it's case insensitive
> 
>
> Key: HIVE-4070
> URL: https://issues.apache.org/jira/browse/HIVE-4070
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 0.10.0
>Reporter: Mark Grover
>Assignee: Mark Grover
>Priority: Trivial
>
> Hive's like operator seems to be case sensitive.
> See 
> https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164
> However, MySQL's like operator is case insensitive. I don't have other DB's 
> (like PostgreSQL) installed and handy but I am guessing their LIKE is case 
> insensitive as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4070) Like operator in Hive is case sensitive while in MySQL (and most likely other DBs) it's case insensitive

2013-05-24 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13666278#comment-13666278
 ] 

John Omernik commented on HIVE-4070:


What about this? Leave it case sensitive as it is, but provide a HIVE Variable 
that allows it to be set globally. That way, an administrator can set the 
behaivior of LIKE based on what their users are migrating in from. I.e. if 
MSSQL or MYSQL  then set it to be case insensitive, else use the default of 
case sensitive. The issue here is one of transitioning and potential false 
negatives because of the assumption. Correct, you can't be consistent with 
every SQL implementation out there, however, I think there is some precedence 
for being LIKE '%mySql%'(see what I did there) which is case insensitive by 
default. 

> Like operator in Hive is case sensitive while in MySQL (and most likely other 
> DBs) it's case insensitive
> 
>
> Key: HIVE-4070
> URL: https://issues.apache.org/jira/browse/HIVE-4070
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 0.10.0
>Reporter: Mark Grover
>Assignee: Mark Grover
>Priority: Trivial
>
> Hive's like operator seems to be case sensitive.
> See 
> https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFLike.java#L164
> However, MySQL's like operator is case insensitive. I don't have other DB's 
> (like PostgreSQL) installed and handy but I am guessing their LIKE is case 
> insensitive as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (HIVE-3979) Provide syntax for unescaped regex on rlike, and other regexp_* functions

2013-02-03 Thread John Omernik (JIRA)
John Omernik created HIVE-3979:
--

 Summary: Provide syntax for unescaped regex on rlike, and other 
regexp_* functions
 Key: HIVE-3979
 URL: https://issues.apache.org/jira/browse/HIVE-3979
 Project: Hive
  Issue Type: New Feature
  Components: UDF
Affects Versions: 0.10.0
 Environment: All Queries, JDBC and cli 
Reporter: John Omernik
 Fix For: 0.11.0


To make hive more intuitive, can we provide a syntax for hive functions that 
when encompassing a regex, that regex doesn't have to the standard hive double 
escape? For example

column rlike '\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}'
could become
column rlike '/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/'

In this case, just using PERL's regex brackets, but could be anything...

The point is to be able to take a standard JAVA/Perl type regex and cut and 
paste it into a query. This would allow for faster query development, more 
mainstream adoption by those with regex experience, and intuitive transition to 
hive query writing. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira