Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Jae Lee Wed, 01 Dec 2010 08:24:46 -0800

Thanks Gerrit,

yeah it seems to work as in it loads up the files properly...


however it fails to understand schema and there's no way to specify the 
underlying schema....

Would you have any recommendation to get the schema right?

J

On 1 Dec 2010, at 15:48, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> 
> 
> Short answer is yes. As long as the partition keys are reflected in the
> folder path itself AllLoader will pick it up.
> 
> Partition keys in hive are (normally from my understanding) reflected in the
> file path itself so that if you have 
> partitions: type, date
> The table path will actually be
> $HIVE_ROOT/warehouse/mytable/type=[value]/date=[value]
> 
> The AllLoader does understand this type of partitioning. So that if you
> point it to load $HIVE_ROOT/warehouse/mytable
> It will allow you to use the type and date columns to filte (note that you
> can only specify the filtering in the AllLoader() part  see:
> https://issues.apache.org/jira/browse/PIG-1717 )
> 
> The partitioning is detected by the AllLoader (and HiveColumnarLoader) by
> looking at the actual folders in the path, and reading all key=value
> patterns in the path name itself, registering these internally as partition
> keys.
> 
> 
> -----Original Message-----
> From: Jae Lee [mailto:[email protected]] 
> Sent: Wednesday, December 01, 2010 2:03 PM
> To: [email protected]
> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
> 
> Hi Gerrit,
> 
> Yeah Hive table isn't stored as RCFILE but TEXTFILE
> 
> so our table creation ddl looks like below
> 
> CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
>     page_url STRING, referrer_url STRING,
>     ip STRING COMMENT 'IP Address of the User',
>     country STRING COMMENT 'country of origination')
> COMMENT 'This is the staging page view table'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> 
> Does AllLoader understand notion of partition keys? as HiveColumnarLoader?
> 
> J
> 
> On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> The HiveColumnarLoader can only read files written by hive or the hive
>> API(s), and has its own InputFormat returning the HiveRCRecordReader.
>> 
>> Are you trying to read a plain text format? 
>> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
>> read the input file and throws an error either if the file is not hive rc
> or
>> is a corrupt hiverc.
>> 
>> 
>> If what you want is a Loader that loads all types of files, have a look at
>> the AllLoader (latest piggybank trunk). It uses configuration that you set
>> in the pig.properties to decide on the fly what loader to use for what
> files
>> (does extension, content and path matching), it also has the hive style
> path
>> partitioning for dates etc. Using this loader you can point it at a
> directoy
>> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
>> correctly it will load each file with its preconfigured loader.
>> The javadoc in the class explains how to configure it.
>> 
>> Cheers,
>> Gerrit
>> 
>> -----Original Message-----
>> From: Jae Lee [mailto:[email protected]] 
>> Sent: Wednesday, December 01, 2010 12:33 PM
>> To: [email protected]
>> Subject: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
>> 
>> Hi everyone.
>> 
>> I've tried using HiveColumnarLoader and getting java.io.IOException:
>> hdfs://file_path not a RCFile
>> 
>> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
>> prepareToRead method..
>> 
>> Could you guys give any guidance how possible it is to modify
>> HiveRCRecordReader to support any RecordReader?
>> 
>> J
>> 
>> 
> 
> 
>

Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Reply via email to