> On Nov. 7, 2014, 9:42 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 
> > 228
> > <https://reviews.apache.org/r/27719/diff/2/?file=754739#file754739line228>
> >
> >     One thing I'm not clear is that why cloning the operator tree doesn't 
> > clone the missed stats flags. From Utilities.cloneOperatorTree(), it seems 
> > it should.
> 
> Na Yang wrote:
>     Hi Xuefu, the stats flag is set after the cloneOperatorTree happens. The 
> stats flag is set in the processFileSink step - 
> GenMapRedUtils.isMergeRequired. Since we did not put the cloned filesinks to 
> the fileSinkSet, so the stats flags are only set to the original 
> FileSinkOperator. In the GenMapRedUtils.processFileSink API, I add the 
> following to get the stats flag from the original filesinkop and set to the 
> cloned filesinkops. 
>     
>         // Set stats config for FileSinkOperators which are cloned from the 
> fileSink
>         List<FileSinkOperator> fileSinkList = 
> context.fileSinkMap.get(fileSink);
>         if (fileSinkList != null) {
>           for (FileSinkOperator fsOp : fileSinkList) {
>             fsOp.getConf().setGatherStats(fileSink.getConf().isGatherStats());
>             
> fsOp.getConf().setStatsReliable(fileSink.getConf().isStatsReliable());
>             
> fsOp.getConf().setMaxStatsKeyPrefixLength(fileSink.getConf().getMaxStatsKeyPrefixLength());
>           }
>         }
> 
> Xuefu Zhang wrote:
>     Thanks for the explanation. If we do put the closed FileSinkOperators in 
> fileSinkSet, is it true that we don't have to manually copy the flags over?

Yes, if we put the cloned FileSinkOperators in fileSinkSet, we do not have to 
manually copy the flags over. However, this will make the plan more complicated 
because each file sink operator in the fileSinkSet may genearte a Merge and 
Move task. To avoid the duplicated Merge and Move tasks, we only put the 
original FileSink operator to the fileSinkSet. I remember I saw some issues 
(wrong data result) before when putting the cloned filesink operators in the 
fileSinkSet (this issue also happens in Tez actually), so we removed the 
duplicated filesink operators from the fileSinkSet.


- Na


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60385
-----------------------------------------------------------


On Nov. 7, 2014, 9:16 p.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:16 p.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused 
> by the FileSinkOperator in the ReduceWork is not set the stats config. In the 
> GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new 
> FileSinkOperator is generated and set to the reduce work. However, during 
> processFileSink, the original FileSinkOperator is set the collectStats tag in 
> GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in 
> the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 
> 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 
> e8e18a7 
>   ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
>   ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 
> 4946815 
>   ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>

Reply via email to