Re: How to write "date, timestamp, decimal" data to Parquet-files

Ryan Blue Fri, 11 Mar 2016 09:24:05 -0800

Yes, it is supported in 1.2.1. It went in here:


https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b

Are you using a version of Parquet with that pull request in it? Also, if
you're using CDH this may not work.

rb

On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi <[email protected]>
wrote:

> Hello Ryan:
>
> I am using hive-version: 1.2.1, as indicated below:
>
> --------------------------------------
> $ hive --version
> Hive 1.2.1
> Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> From source with checksum ab480aca41b24a9c3751b8c023338231
> $
> --------------------------------------
>
> As I understand, this version of "hive" supports "date" datatype. right ?.
> Do you want me to re-test using any other higher-version of hive ? Pl. let
> me know your thoughts.
>
> Thanks,
>  Ravi
>
>
>
> From:   Ryan Blue <[email protected]>
> To:     Parquet Dev <[email protected]>
> Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date:   03/11/2016 06:18 AM
> Subject:        Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> What version of Hive are you using? You should make sure date is supported
> there.
>
> rb
>
> On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi <[email protected]>
> wrote:
>
> > Hello Ryan:
> >
> > Many thanks for the reply. I see that, the text-attachment containing my
> > test-program is not sent to the mail-group, but got filtered out. Hence,
> > copying the program-code below:
> >
> > =================================================================
> > import java.io.IOException;
> > import java.util.*;
> > import org.apache.hadoop.conf.Configuration;
> > import org.apache.hadoop.fs.FileSystem;
> > import org.apache.hadoop.fs.Path;
> > import org.apache.avro.Schema;
> > import org.apache.avro.Schema.Type;
> > import org.apache.avro.Schema.Field;
> > import org.apache.avro.generic.* ;
> > import org.apache.avro.LogicalTypes;
> > import org.apache.avro.LogicalTypes.*;
> > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > import parquet.avro.*;
> >
> > public class pqtw {
> >
> > public static Schema makeSchema() {
> >      List<Field> fields = new ArrayList<Field>();
> >      fields.add(new Field("name", Schema.create(Type.STRING), null,
> > null));
> >      fields.add(new Field("age", Schema.create(Type.INT), null, null));
> >
> >      Schema date =
> > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> >      fields.add(new Field("doj", date, null, null));
> >
> >      Schema schema = Schema.createRecord("filecc", null, "parquet",
> > false);
> >      schema.setFields(fields);
> >
> >      return(schema);
> > }
> >
> > public static GenericData.Record makeRecord (Schema schema, String name,
> > int age, int doj) {
> >      GenericData.Record record = new GenericData.Record(schema);
> >      record.put("name", name);
> >      record.put("age", age);
> >      record.put("doj", doj);
> >      return(record);
> > }
> >
> > public static void main(String[] args) throws IOException,
> >
> >     InterruptedException, ClassNotFoundException {
> >
> >         String pqfile = "/tmp/pqtfile1";
> >
> >         try {
> >
> >         Configuration conf = new Configuration();
> >         FileSystem fs = FileSystem.getLocal(conf);
> >
> >         Schema schema = makeSchema() ;
> >         GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
> >         AvroParquetWriter writer = new AvroParquetWriter(new
> Path(pqfile),
> > schema);
> >         writer.write(rec);
> >         writer.close();
> >         }
> >         catch (Exception e)
> >         {
> >                 e.printStackTrace();
> >         }
> >     }
> > }
> > =================================================================
> >
> > With the above logic, I could write the data to parquet-file. However,
> > when I load the same into a hive-table & select columns, I could select
> > the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully,
> but
> > select of "date" column failed with the error given below:
> >
> >
> >
>
> --------------------------------------------------------------------------------
> > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
> > PARQUET ;
> > OK
> > Time taken: 0.369 seconds
> > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > hive> SELECT name,age from PT1;
> > OK
> > abcd    21
> > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > hive> SELECT doj from PT1;
> > OK
> > Failed with exception
> > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be
> > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > Time taken: 0.167 seconds
> > hive>
> >
> >
>
> --------------------------------------------------------------------------------
> >
> > Basically, for "date datatype", I am trying to pass an integer-value
> (for
> > the # of days from Unix epoch, 1 January 1970, so that the date falls
> > somewhere around 2011..etc). Is this the correct approach to process
> date
> > data (or is there any other approach / API to do it) ? Could you please
> > let me know your inputs, in this regard ?
> >
> > Thanks,
> >  Ravi
> >
> >
> >
> > From:   Ryan Blue <[email protected]>
> > To:     Parquet Dev <[email protected]>
> > Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date:   03/09/2016 10:48 PM
> > Subject:        Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Hi Ravi,
> >
> > Not all of the types are fully-implemented yet. I think Hive only has
> > partial support. If I remember correctly:
> > * Decimal is supported if the backing primitive type is fixed-length
> > binary
> > * Date and Timestamp are supported, but Time has not been implemented
> yet
> >
> > For object models you can build applications on (instead of those
> embedded
> > in SQL), only Avro objects can support those types through its
> > LogicalTypes
> > API. That API has been implemented in parquet-avro, but not yet
> committed.
> > I would like for this feature to make it into 1.9.0. If you want to test
> > in
> > the mean time, check out the pull request:
> >
> >   https://github.com/apache/parquet-mr/pull/318
> >
> > rb
> >
> > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <[email protected]>
> > wrote:
> >
> > > Hello,
> > >
> > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> test-tool,
> > > that writes data to Parquet-files, which can be imported into
> > hive-tables.
> > > Pl. find attached sample-program, which writes simple
> parquet-data-file:
> > >
> > >
> > >
> > > Using the above program, I could create "parquet-files" with
> data-types:
> > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > supported
> > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > successfully.
> > >
> > > Now, I am trying to figure out, how to write "date, timestamp, decimal
> > > data" into parquet-files.  In this context, I request you provide the
> > > possible options (and/or sample-program, if any..), in this regard.
> > >
> > > Thanks,
> > >  Ravi
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: How to write "date, timestamp, decimal" data to Parquet-files

Reply via email to