Re: How to write "date, timestamp, decimal" data to Parquet-files

Ravi Tatapudi Fri, 11 Mar 2016 00:41:59 -0800

Hello Ryan:

I am using hive-version: 1.2.1, as indicated below:


--------------------------------------
$ hive --version
Hive 1.2.1
Subversion git://localhost.localdomain/home/sush/dev/hive.git -r 
243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
>From source with checksum ab480aca41b24a9c3751b8c023338231
$
--------------------------------------

As I understand, this version of "hive" supports "date" datatype. right ?. 
Do you want me to re-test using any other higher-version of hive ? Pl. let 
me know your thoughts.

Thanks,
 Ravi



From:   Ryan Blue <[email protected]>
To:     Parquet Dev <[email protected]>
Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas 
Mudigonda/India/IBM@IBMIN
Date:   03/11/2016 06:18 AM
Subject:        Re: How to write "date, timestamp, decimal" data to 
Parquet-files



What version of Hive are you using? You should make sure date is supported
there.

rb

On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi <[email protected]>
wrote:

> Hello Ryan:
>
> Many thanks for the reply. I see that, the text-attachment containing my
> test-program is not sent to the mail-group, but got filtered out. Hence,
> copying the program-code below:
>
> =================================================================
> import java.io.IOException;
> import java.util.*;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.Path;
> import org.apache.avro.Schema;
> import org.apache.avro.Schema.Type;
> import org.apache.avro.Schema.Field;
> import org.apache.avro.generic.* ;
> import org.apache.avro.LogicalTypes;
> import org.apache.avro.LogicalTypes.*;
> import org.apache.hadoop.hive.common.type.HiveDecimal;
> import parquet.avro.*;
>
> public class pqtw {
>
> public static Schema makeSchema() {
>      List<Field> fields = new ArrayList<Field>();
>      fields.add(new Field("name", Schema.create(Type.STRING), null,
> null));
>      fields.add(new Field("age", Schema.create(Type.INT), null, null));
>
>      Schema date =
> LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
>      fields.add(new Field("doj", date, null, null));
>
>      Schema schema = Schema.createRecord("filecc", null, "parquet",
> false);
>      schema.setFields(fields);
>
>      return(schema);
> }
>
> public static GenericData.Record makeRecord (Schema schema, String name,
> int age, int doj) {
>      GenericData.Record record = new GenericData.Record(schema);
>      record.put("name", name);
>      record.put("age", age);
>      record.put("doj", doj);
>      return(record);
> }
>
> public static void main(String[] args) throws IOException,
>
>     InterruptedException, ClassNotFoundException {
>
>         String pqfile = "/tmp/pqtfile1";
>
>         try {
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.getLocal(conf);
>
>         Schema schema = makeSchema() ;
>         GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
>         AvroParquetWriter writer = new AvroParquetWriter(new 
Path(pqfile),
> schema);
>         writer.write(rec);
>         writer.close();
>         }
>         catch (Exception e)
>         {
>                 e.printStackTrace();
>         }
>     }
> }
> =================================================================
>
> With the above logic, I could write the data to parquet-file. However,
> when I load the same into a hive-table & select columns, I could select
> the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully, 
but
> select of "date" column failed with the error given below:
>
>
> 
--------------------------------------------------------------------------------
> hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
> PARQUET ;
> OK
> Time taken: 0.369 seconds
> hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> hive> SELECT name,age from PT1;
> OK
> abcd    21
> Time taken: 0.311 seconds, Fetched: 1 row(s)
> hive> SELECT doj from PT1;
> OK
> Failed with exception
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be
> cast to org.apache.hadoop.hive.serde2.io.DateWritable
> Time taken: 0.167 seconds
> hive>
>
> 
--------------------------------------------------------------------------------
>
> Basically, for "date datatype", I am trying to pass an integer-value 
(for
> the # of days from Unix epoch, 1 January 1970, so that the date falls
> somewhere around 2011..etc). Is this the correct approach to process 
date
> data (or is there any other approach / API to do it) ? Could you please
> let me know your inputs, in this regard ?
>
> Thanks,
>  Ravi
>
>
>
> From:   Ryan Blue <[email protected]>
> To:     Parquet Dev <[email protected]>
> Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date:   03/09/2016 10:48 PM
> Subject:        Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Hi Ravi,
>
> Not all of the types are fully-implemented yet. I think Hive only has
> partial support. If I remember correctly:
> * Decimal is supported if the backing primitive type is fixed-length
> binary
> * Date and Timestamp are supported, but Time has not been implemented 
yet
>
> For object models you can build applications on (instead of those 
embedded
> in SQL), only Avro objects can support those types through its
> LogicalTypes
> API. That API has been implemented in parquet-avro, but not yet 
committed.
> I would like for this feature to make it into 1.9.0. If you want to test
> in
> the mean time, check out the pull request:
>
>   https://github.com/apache/parquet-mr/pull/318
>
> rb
>
> On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <[email protected]>
> wrote:
>
> > Hello,
> >
> > I am Ravi Tatapudi, from IBM-India. I am working on a simple 
test-tool,
> > that writes data to Parquet-files, which can be imported into
> hive-tables.
> > Pl. find attached sample-program, which writes simple 
parquet-data-file:
> >
> >
> >
> > Using the above program, I could create "parquet-files" with 
data-types:
> > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> supported
> > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > successfully.
> >
> > Now, I am trying to figure out, how to write "date, timestamp, decimal
> > data" into parquet-files.  In this context, I request you provide the
> > possible options (and/or sample-program, if any..), in this regard.
> >
> > Thanks,
> >  Ravi
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: How to write "date, timestamp, decimal" data to Parquet-files

Reply via email to