[ 
https://issues.apache.org/jira/browse/HIVE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895428#action_12895428
 ] 

Ning Zhang commented on HIVE-1510:
----------------------------------

It's fine for me if you feel strong for it. The concern from me (besides 
har+CHIF support) is the performance implication when using CHIF merging large 
number of small files inside a partition. Siying has a use case where the 
pathToPartitionInfo is very large and the # of files in the splits is also very 
large. Determining whether partitionDesc for each input path takes a long time. 
In your patch, you have another HashMap for the path part of the 
pathToPartitionInfo (which trade memory for speed), but introduced another loop 
for comparing parent of paths. It would be nice (better performance) if you 
could avoid this loop by simply appending '/' at the end.  But if it doesn't 
hurt the performance or appending '/' doesn't work, the current patch is fine 
for me too.

As an aside, we should find out why pathToPartitionInfo in some cases contains 
paths only rather than the full URI. The ideal case is that it should always 
contains the full URI so that we don't rely on heuristics. But this could be 
another JIRA.

> HiveCombineInputFormat should not use prefix matching to find the 
> partitionDesc for a given path
> ------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1510
>                 URL: https://issues.apache.org/jira/browse/HIVE-1510
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1510.1.patch
>
>
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by 
> (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", 
> hr="00") select * from src;
> alter table combine_3_srcpart_seq_rc set fileformat rcfile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", 
> hr="001") select * from src;
> desc extended combine_3_srcpart_seq_rc partition(ds="2010-08-03", hr="00");
> desc extended combine_3_srcpart_seq_rc partition(ds="2010-08-03", hr="001");
> select * from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key;
> drop table combine_3_srcpart_seq_rc;
> will fail.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to