[jira] [Created] (HIVE-21955) SearchArgumentImpl generates wrong ExpressionTree in some cases which might result in loss of data

2019-07-04 Thread Zihao Ye (JIRA)
Zihao Ye created HIVE-21955:
---

 Summary: SearchArgumentImpl generates wrong ExpressionTree in some 
cases which might result in loss of data 
 Key: HIVE-21955
 URL: https://issues.apache.org/jira/browse/HIVE-21955
 Project: Hive
  Issue Type: Bug
  Components: Hive, ORC
Reporter: Zihao Ye


ExpressionBuilder applies `pushDownNot`, `foldMaybe`, `flatten`, 
`convertToCNF`, `flatten` and `buildLeafList` in order to form a non-normalized 
expression into a CNF expression with the unique leaves.

After an expression is converted to CNF, there might be more than one non-leaf 
node which are exactly the same object in the expression tree. If this happens, 
those non-leaf node will be visited more than once in `buildLeafList` function. 
As a result, a wrong ExpressionTree is generated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-21486) FinalSelectOps is empty in lineage index if there is a script operator(transform)

2019-03-21 Thread Zihao Ye (JIRA)
Zihao Ye created HIVE-21486:
---

 Summary: FinalSelectOps is empty in lineage index if there is a 
script operator(transform)
 Key: HIVE-21486
 URL: https://issues.apache.org/jira/browse/HIVE-21486
 Project: Hive
  Issue Type: Bug
  Components: lineage
Affects Versions: 2.3.4, 2.1.1
Reporter: Zihao Ye


SQL pattern:

create table t1 as select transform(c1) using '/bin/python script.py' as (c2) 
from t2;

Lineage dependencies are correct. But the SelectOperator is not added to the 
finalSelectOps in Lineage Index. So that index.getDependencies(finalSelOp) got 
null in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HIVE-20912) Output data might be duplicated while speculation is enabled

2018-11-14 Thread Zihao Ye (JIRA)
Zihao Ye created HIVE-20912:
---

 Summary: Output data might be duplicated while speculation is 
enabled
 Key: HIVE-20912
 URL: https://issues.apache.org/jira/browse/HIVE-20912
 Project: Hive
  Issue Type: Bug
  Components: Hive, Operators
Affects Versions: 1.2.1
 Environment: Hive 1.2.1

Hadoop 2.7.3

Tez 0.7.0
Reporter: Zihao Ye
 Attachments: image-2018-11-14-17-48-59-826.png, 
image-2018-11-14-17-53-13-191.png, image-2018-11-14-17-53-50-171.png, 
image-2018-11-14-19-28-18-924.png

The file merge stage had two tasks, which should create two files, but there 
was three files created.

!image-2018-11-14-19-28-18-924.png!

By tracing the log, we found that there were two task attempts(one of them was 
a speculation) finished in one second by such a coincidence. Although the later 
one received a kill signal from AM, the rename operation was already done at 
that time, which cause the data duplication.

The rename operation was done at _AbstractFileMergeOperator.closeOp()_, the __ 
final path name was determined by the task attempt id rather than the task id. 
In this case, the final path ended with '00_0' and '00_1' rather than 
'00'. IMHO, by making the final path name ended with task id without task 
attempt id, one task can only generate at most one file, which could solve this 
issue. But I don't know the side effects for changing the final path name.

This issue also affects other operators related to file renaming like 
JoinOperator and FileSinkOperator.

!image-2018-11-14-17-53-13-191.png!

!image-2018-11-14-17-53-50-171.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)