Re: Review Request 24688: parallel order by clause on a string column fails with IOException: Split points are out of order

Szehon Ho Mon, 25 Aug 2014 12:08:54 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/24688/#review51426
-----------------------------------------------------------



Looks like an important bug to fix, but I dont know too much about this code, 
can you explain what is the bug in the getPartitionKey algorithm, and what is 
the fix?  Like why we need to alter the stepSize as we iterate.  Is there a 
test we can add for this as well to illustrate and validate the fix?

Also my confusion is if the other fixes on the patch are related?

1.  Adding setConf on the HiveTotalOrderPartitioner is related to the bug?
2.  What is the use of the new HiveConf "..min.reducer"?  My guess is you found 
the algorithm not generating enough partitionKey sometimes, can you explain?


common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/24688/#comment89744>

    If this needs to be exposed, should be worded better.  Something like:
    
    name = "hive.optimize.sampling.orderby.min.reducer.ratio"
    
    "If sampling is enabled, this is the minimum ratio allowed of reducers 
calculated by sampling to expected number of reducers".
    
    Its might be confusing to user in my opinion, as the user has little 
control of what the expected reducer is, right?



ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java
<https://reviews.apache.org/r/24688/#comment89742>

    Please add some more context to this debug statement.



ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java
<https://reviews.apache.org/r/24688/#comment89743>

    If needs to be exposed, message can be "Sampling generated x number of 
reducers, but it was expected to be y"


- Szehon Ho


On Aug. 14, 2014, 2:29 a.m., Navis Ryu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/24688/
> -----------------------------------------------------------
> 
> (Updated Aug. 14, 2014, 2:29 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-7669
>     https://issues.apache.org/jira/browse/HIVE-7669
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The source table has 600 Million rows and it has a String column 
> "l_shipinstruct" which has 4 unique values. (Ie. these 4 values are repeated 
> across the 600 million rows)
> 
> We are sorting it based on this string column "l_shipinstruct" as shown in 
> the below HiveQL with the following parameters. 
> {code:sql}
> set hive.optimize.sampling.orderby=true;
> set hive.optimize.sampling.orderby.number=10000000;
> set hive.optimize.sampling.orderby.percent=0.1f;
> 
> insert overwrite table lineitem_temp_report 
> select 
>   l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, 
> l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, 
> l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment
> from 
>   lineitem
> order by l_shipinstruct;
> {code}
> Stack Trace
> Diagnostic Messages for this Task:
> {noformat}
> Error: java.lang.RuntimeException: Error in configuring object
>         at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
>         at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
>         at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>         at 
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.<init>(MapTask.java:569)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> Caused by: java.lang.reflect.InvocationTargetException
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:601)
>         at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
>         ... 10 more
> Caused by: java.lang.IllegalArgumentException: Can't read partitions file
>         at 
> org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:116)
>         at 
> org.apache.hadoop.mapred.lib.TotalOrderPartitioner.configure(TotalOrderPartitioner.java:42)
>         at 
> org.apache.hadoop.hive.ql.exec.HiveTotalOrderPartitioner.configure(HiveTotalOrderPartitioner.java:37)
>         ... 15 more
> Caused by: java.io.IOException: Split points are out of order
>         at 
> org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:96)
>         ... 17 more
> {noformat}
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java af9e198 
>   common/src/java/org/apache/hadoop/hive/conf/Validator.java cea9c41 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/HiveTotalOrderPartitioner.java 
> 6c22362 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java 166461a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecDriver.java ef72039 
> 
> Diff: https://reviews.apache.org/r/24688/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Navis Ryu
> 
>

Re: Review Request 24688: parallel order by clause on a string column fails with IOException: Split points are out of order

Reply via email to