[jira] Commented: (HIVE-493) automatically infer existing partitions of table from HDFS files.

Edward Capriolo (JIRA) Thu, 02 Jul 2009 08:34:18 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726518#action_12726518
 ]


Edward Capriolo commented on HIVE-493:
--------------------------------------

I have a USE CASE for something similar and I wanted to get peoples opinion on 
it. My intake process is a map reduce job that takes as input a list of 
servers. On these servers I connect via FTP and take all the new files. We are 
doing 5 minute logs. 

I have a map only job that writes the Files to a static HDFS folder. After the 
map process is complete I am presented with exactly this problem. 

Do I assume the partition is created, and copy the files? I decided to let hive 
handle this instead.

{noformat}
  String hql=" load data inpath 
'"+conf.get("fs.default.name")+"/user/ecapriolo/pull/raw_web_log/"+p.getName()+
                   "' into table raw_web_data partition 
(log_date_part='"+dateFormat.format(today.getTime())+"')";
      System.out.println("Running "+hql);
      String [] run = new String [] { "/opt/hive/bin/hive", "-e", hql };

 LoadThread lt = new LoadThread(run);
 Thread t = new Thread(lt);
 t.start();
{noformat}

Personally, I do not think we should let users infer into the table layout of 
hive. Users should have tools, whether these be API based or HQL based tools.  
I should not have to mix match between hive -e 'something', map/reduce, bash 
scripting to get a job accomplished (I spent 4 hours trying to get the 
environment correct for my forked 'hive -e query') (I probably should learn 
more about the thrift API )

But that problem I already solved. My next problem is also important to this 
discussion. I now have too many files inside my directory. I am partitioned by 
day, but each server is dropping 5 minute log files. What I really need now is 
a COMPACT function. To merge all these 5 minute data files into one.  What 
would be the proper way to handle this? I could take an all query based 
approach, by selecting all the data into a new table. Then I need to drop the 
partition and selecting the data back into the original table. However I could 
short circuit the operations (and save time) by building the new partition 
first, deleting the old data, and then moving the new data it back using 'dfs 
mv'

Should this be a done through HQL " Compact table X partiton Y "? Or should a 
command like service be done? bin/hive --service compact table X partition Y. 
Doing it all though HQL is possible now, but not optimized in some cases. 
Unless I am missing something. 

I think we need more easily insight into the metastore from HQL like how mysql 
does. show tables is a good step but we need something like a virtual read only 
schema table to query. 

Sorry to be all over the place on this post.

> automatically infer existing partitions of table from HDFS files.
> -----------------------------------------------------------------
>
>                 Key: HIVE-493
>                 URL: https://issues.apache.org/jira/browse/HIVE-493
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.3.0, 0.3.1, 0.4.0
>            Reporter: Prasad Chakka
>
> Initially partition list for a table is inferred from HDFS directory 
> structure instead of looking into metastore (partitions are created using 
> 'alter table ... add partition'). but this automatic inferring was removed to 
> favor the later approach during checking-in metastore checker feature and 
> also to facilitate external partitions.
> Joydeep and Frederick mentioned that it would simple for users to create the 
> HDFS directory and let Hive infer rather than explicitly add a partition. But 
> doing that raises following...
> 1) External partitions -- so we have to mix both approaches and partition 
> list is merged list of inferred partitions and registered partitions. and 
> duplicates have to be resolved.
> 2) Partition level schemas can't supported. Which schema to chose for the 
> inferred partitions? the table schema when the inferred partition is created 
> or the latest tale schema? how do we know the table schema when the inferred 
> partitions is created?
> 3) If partitions have to be registered the partitions can be disabled without 
> actually deleting the data. this feature is not supported and may not be that 
> useful but nevertheless this can't be supported with inferred partitions
> 4) Indexes are being added. So if partitions are not registered then indexes 
> for such partitions can not be maintained automatically.
> I would like to know what is the general thinking about this among users of 
> Hive. If inferred partitions are preferred then can we live with restricted 
> functionality that this imposes?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-493) automatically infer existing partitions of table from HDFS files.

Reply via email to