Re: How to write "date, timestamp, decimal" data to Parquet-files

Ravi Tatapudi Mon, 04 Apr 2016 05:40:36 -0700

Hello Ryan:

Regarding the support for "date, timestamp, decimal" data types for 
Parquet-files:


In your earlier mail, you have mentioned the pull-request-URL: 
https://github.com/apache/parquet-mr/pull/318 has the necessary support 
for these data-types (and that it would be released as part of 
parquet-avro-release:1.9.0). 

I see that, this fix is included in build# 1247 (& above?). How to get 
this build (or the latest-build), that includes the JAR-file: 
"parquet-avro" including the support for "date,timestamp"..etc. ? Could 
you please let me know.

Thanks,
 Ravi



From:   Ryan Blue <[email protected]>
To:     Parquet Dev <[email protected]>
Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas 
Mudigonda/India/IBM@IBMIN
Date:   03/14/2016 09:56 PM
Subject:        Re: How to write "date, timestamp, decimal" data to 
Parquet-files



Ravi,

Support for those types in parquet-avro hasn't been committed yet. It's
implemented in the branch I pointed you to. If you want to use released
versions, it should be out in 1.9.0.

rb

On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi <[email protected]>
wrote:

> Hello Ryan:
>
> Thanks for the inputs.
>
> I am building & running the test-application, primarily using the
> following JAR-files (for Avro, Parquet-Avro & Hive APIs):
>
> 1) avro-1.8.0.jar
> 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> maven-repository-URL:
> http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> 3) hive-exec-1.2.1.jar
>
> Am I supposed to build/run the test, using a different version of the
> JAR-files ? Could you please let me know.
>
> Thanks,
>  Ravi
>
>
>
>
> From:   Ryan Blue <[email protected]>
> To:     Parquet Dev <[email protected]>
> Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date:   03/11/2016 10:54 PM
> Subject:        Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Yes, it is supported in 1.2.1. It went in here:
>
>
>
> 
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b

>
>
> Are you using a version of Parquet with that pull request in it? Also, 
if
> you're using CDH this may not work.
>
> rb
>
> On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi 
<[email protected]>
> wrote:
>
> > Hello Ryan:
> >
> > I am using hive-version: 1.2.1, as indicated below:
> >
> > --------------------------------------
> > $ hive --version
> > Hive 1.2.1
> > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > From source with checksum ab480aca41b24a9c3751b8c023338231
> > $
> > --------------------------------------
> >
> > As I understand, this version of "hive" supports "date" datatype. 
right
> ?.
> > Do you want me to re-test using any other higher-version of hive ? Pl.
> let
> > me know your thoughts.
> >
> > Thanks,
> >  Ravi
> >
> >
> >
> > From:   Ryan Blue <[email protected]>
> > To:     Parquet Dev <[email protected]>
> > Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date:   03/11/2016 06:18 AM
> > Subject:        Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > What version of Hive are you using? You should make sure date is
> supported
> > there.
> >
> > rb
> >
> > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> <[email protected]>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > Many thanks for the reply. I see that, the text-attachment 
containing
> my
> > > test-program is not sent to the mail-group, but got filtered out.
> Hence,
> > > copying the program-code below:
> > >
> > > =================================================================
> > > import java.io.IOException;
> > > import java.util.*;
> > > import org.apache.hadoop.conf.Configuration;
> > > import org.apache.hadoop.fs.FileSystem;
> > > import org.apache.hadoop.fs.Path;
> > > import org.apache.avro.Schema;
> > > import org.apache.avro.Schema.Type;
> > > import org.apache.avro.Schema.Field;
> > > import org.apache.avro.generic.* ;
> > > import org.apache.avro.LogicalTypes;
> > > import org.apache.avro.LogicalTypes.*;
> > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > import parquet.avro.*;
> > >
> > > public class pqtw {
> > >
> > > public static Schema makeSchema() {
> > >      List<Field> fields = new ArrayList<Field>();
> > >      fields.add(new Field("name", Schema.create(Type.STRING), null,
> > > null));
> > >      fields.add(new Field("age", Schema.create(Type.INT), null,
> null));
> > >
> > >      Schema date =
> > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > >      fields.add(new Field("doj", date, null, null));
> > >
> > >      Schema schema = Schema.createRecord("filecc", null, "parquet",
> > > false);
> > >      schema.setFields(fields);
> > >
> > >      return(schema);
> > > }
> > >
> > > public static GenericData.Record makeRecord (Schema schema, String
> name,
> > > int age, int doj) {
> > >      GenericData.Record record = new GenericData.Record(schema);
> > >      record.put("name", name);
> > >      record.put("age", age);
> > >      record.put("doj", doj);
> > >      return(record);
> > > }
> > >
> > > public static void main(String[] args) throws IOException,
> > >
> > >     InterruptedException, ClassNotFoundException {
> > >
> > >         String pqfile = "/tmp/pqtfile1";
> > >
> > >         try {
> > >
> > >         Configuration conf = new Configuration();
> > >         FileSystem fs = FileSystem.getLocal(conf);
> > >
> > >         Schema schema = makeSchema() ;
> > >         GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) 
;
> > >         AvroParquetWriter writer = new AvroParquetWriter(new
> > Path(pqfile),
> > > schema);
> > >         writer.write(rec);
> > >         writer.close();
> > >         }
> > >         catch (Exception e)
> > >         {
> > >                 e.printStackTrace();
> > >         }
> > >     }
> > > }
> > > =================================================================
> > >
> > > With the above logic, I could write the data to parquet-file. 
However,
> > > when I load the same into a hive-table & select columns, I could
> select
> > > the columns: "name", "age" (i.e., VARCHAR, INT columns) 
successfully,
> > but
> > > select of "date" column failed with the error given below:
> > >
> > >
> > >
> >
> >
>
> 
--------------------------------------------------------------------------------
> > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED 
AS
> > > PARQUET ;
> > > OK
> > > Time taken: 0.369 seconds
> > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > hive> SELECT name,age from PT1;
> > > OK
> > > abcd    21
> > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > hive> SELECT doj from PT1;
> > > OK
> > > Failed with exception
> > > 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
cannot
> be
> > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > Time taken: 0.167 seconds
> > > hive>
> > >
> > >
> >
> >
>
> 
--------------------------------------------------------------------------------
> > >
> > > Basically, for "date datatype", I am trying to pass an integer-value
> > (for
> > > the # of days from Unix epoch, 1 January 1970, so that the date 
falls
> > > somewhere around 2011..etc). Is this the correct approach to process
> > date
> > > data (or is there any other approach / API to do it) ? Could you
> please
> > > let me know your inputs, in this regard ?
> > >
> > > Thanks,
> > >  Ravi
> > >
> > >
> > >
> > > From:   Ryan Blue <[email protected]>
> > > To:     Parquet Dev <[email protected]>
> > > Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date:   03/09/2016 10:48 PM
> > > Subject:        Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > Hi Ravi,
> > >
> > > Not all of the types are fully-implemented yet. I think Hive only 
has
> > > partial support. If I remember correctly:
> > > * Decimal is supported if the backing primitive type is fixed-length
> > > binary
> > > * Date and Timestamp are supported, but Time has not been 
implemented
> > yet
> > >
> > > For object models you can build applications on (instead of those
> > embedded
> > > in SQL), only Avro objects can support those types through its
> > > LogicalTypes
> > > API. That API has been implemented in parquet-avro, but not yet
> > committed.
> > > I would like for this feature to make it into 1.9.0. If you want to
> test
> > > in
> > > the mean time, check out the pull request:
> > >
> > >   https://github.com/apache/parquet-mr/pull/318
> > >
> > > rb
> > >
> > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> <[email protected]>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > test-tool,
> > > > that writes data to Parquet-files, which can be imported into
> > > hive-tables.
> > > > Pl. find attached sample-program, which writes simple
> > parquet-data-file:
> > > >
> > > >
> > > >
> > > > Using the above program, I could create "parquet-files" with
> > data-types:
> > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > supported
> > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > successfully.
> > > >
> > > > Now, I am trying to figure out, how to write "date, timestamp,
> decimal
> > > > data" into parquet-files.  In this context, I request you provide
> the
> > > > possible options (and/or sample-program, if any..), in this 
regard.
> > > >
> > > > Thanks,
> > > >  Ravi
> > > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: How to write "date, timestamp, decimal" data to Parquet-files

Reply via email to