hi 
I have few questions on hive and its use case.

1. hive-on-hadoop-20 accessing/processing data stored on hadoop-18-dfs
    The actual files are on hadoop-18 dfs and then I will create external table 
on hive-on-hadoop-20 with files pointing to hadoop-18-dfs.
    I don't think this is possible , given hadoop version incompatibility; but 
never hurts to ask


2. We download tons of urls and massage the data. The massaging goes thru 
various stages. We would like to monitor these stages 
   so I was thinking on doing a schema like following 
        One Table :
        
        url as STRING,
        
        massage_step1 is a STRUCT 
        massage_step2 is a STRUCT
        .
        .
        feature_set  is ARRAY<STRING>

      The STRUCT can have arrays on longs, ids, timestamps, success/failure, 
reasons


     Assuming tht I am correct track here :
        will I able to run queries like :
                q1. where massage_step1.reasons like '%Failed on fetching%'
                q2. where feature_set like 'shopping'
                        (feature set is an array, I think I have to implement a 
UDFLike for Arrays)
                q3. where massage_step2.ids < 10K 

                q4. where count(*)  as count  where timestamps < 'SOME_DATE'  
group by massage_step1.success = true
                
        In short , can I query on data in the complex types like Struct, Array, 
Map etc

3. Some of queries will require data from 2 or more structs and some wont. 
    In the above example, I keeping it one table (external table). The other 
option is multiple tables: one for each massage_step.
                
    In case of multiple tables, I will have to fire JOIN queries and in case of 
single table , I will filter data using where clause

   What is expensive: JOIN queries or filtering data using where clause ? 


Feedback is greatly appreciated

Thanks,
Sagar
        
   
        

Reply via email to