Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Jae Lee Wed, 01 Dec 2010 08:48:44 -0800

Hi,

Thanks for letting me know about the Jira ticket.
yes it would be necessary to have those partition as part of schema to group 
them by.


J

On 1 Dec 2010, at 16:33, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> 
> You'll have to tell pig in the AS statement what the schema is: 
> e.g. I = LOAD '$INPUT' using AllLoader() AS ( valueTime:int, userid:long,
> page_url:chararray, referrer_url:chararray, ip:chararray, country:chararray
> );
> 
> The only problem with the AllLoader currently (until the jira I sent earlier
> is fixed) is that the partition keys won't be in the schema itself, but you
> can still filter by partition using the all loader constructor: for example
> AllLoader("date>='2010-11-01'")
> 
> 
> Cheers,
> Gerrit
> 
> 
> viewTime INT, userid BIGINT,
>>    page_url STRING, referrer_url STRING,
>>    ip STRING COMMENT 'IP Address of the User',
>>    country STRING COMMENT 'country of originate
> 
> -----Original Message-----
> From: Jae Lee [mailto:jae....@forward.co.uk] 
> Sent: Wednesday, December 01, 2010 4:24 PM
> To: dev@pig.apache.org
> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
> 
> Thanks Gerrit,
> 
> yeah it seems to work as in it loads up the files properly...
> 
> however it fails to understand schema and there's no way to specify the
> underlying schema....
> 
> Would you have any recommendation to get the schema right?
> 
> J
> 
> On 1 Dec 2010, at 15:48, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> 
>> 
>> Short answer is yes. As long as the partition keys are reflected in the
>> folder path itself AllLoader will pick it up.
>> 
>> Partition keys in hive are (normally from my understanding) reflected in
> the
>> file path itself so that if you have 
>> partitions: type, date
>> The table path will actually be
>> $HIVE_ROOT/warehouse/mytable/type=[value]/date=[value]
>> 
>> The AllLoader does understand this type of partitioning. So that if you
>> point it to load $HIVE_ROOT/warehouse/mytable
>> It will allow you to use the type and date columns to filte (note that you
>> can only specify the filtering in the AllLoader() part  see:
>> https://issues.apache.org/jira/browse/PIG-1717 )
>> 
>> The partitioning is detected by the AllLoader (and HiveColumnarLoader) by
>> looking at the actual folders in the path, and reading all key=value
>> patterns in the path name itself, registering these internally as
> partition
>> keys.
>> 
>> 
>> -----Original Message-----
>> From: Jae Lee [mailto:jae....@forward.co.uk] 
>> Sent: Wednesday, December 01, 2010 2:03 PM
>> To: dev@pig.apache.org
>> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
>> fileformat?
>> 
>> Hi Gerrit,
>> 
>> Yeah Hive table isn't stored as RCFILE but TEXTFILE
>> 
>> so our table creation ddl looks like below
>> 
>> CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
>>    page_url STRING, referrer_url STRING,
>>    ip STRING COMMENT 'IP Address of the User',
>>    country STRING COMMENT 'country of origination')
>> COMMENT 'This is the staging page view table'
>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>> STORED AS TEXTFILE
>> 
>> Does AllLoader understand notion of partition keys? as HiveColumnarLoader?
>> 
>> J
>> 
>> On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:
>> 
>>> Hi,
>>> 
>>> The HiveColumnarLoader can only read files written by hive or the hive
>>> API(s), and has its own InputFormat returning the HiveRCRecordReader.
>>> 
>>> Are you trying to read a plain text format? 
>>> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
>>> read the input file and throws an error either if the file is not hive rc
>> or
>>> is a corrupt hiverc.
>>> 
>>> 
>>> If what you want is a Loader that loads all types of files, have a look
> at
>>> the AllLoader (latest piggybank trunk). It uses configuration that you
> set
>>> in the pig.properties to decide on the fly what loader to use for what
>> files
>>> (does extension, content and path matching), it also has the hive style
>> path
>>> partitioning for dates etc. Using this loader you can point it at a
>> directoy
>>> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
>>> correctly it will load each file with its preconfigured loader.
>>> The javadoc in the class explains how to configure it.
>>> 
>>> Cheers,
>>> Gerrit
>>> 
>>> -----Original Message-----
>>> From: Jae Lee [mailto:jae....@forward.co.uk] 
>>> Sent: Wednesday, December 01, 2010 12:33 PM
>>> To: dev@pig.apache.org
>>> Subject: has anyone tried using HiveColumnarLoader over TextFile
>> fileformat?
>>> 
>>> Hi everyone.
>>> 
>>> I've tried using HiveColumnarLoader and getting java.io.IOException:
>>> hdfs://file_path not a RCFile
>>> 
>>> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
>>> prepareToRead method..
>>> 
>>> Could you guys give any guidance how possible it is to modify
>>> HiveRCRecordReader to support any RecordReader?
>>> 
>>> J
>>> 
>>> 
>> 
>> 
>> 
> 
> 
>

Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Reply via email to