Re: How to write "date, timestamp, decimal" data to Parquet-files

Ravi Tatapudi Thu, 10 Mar 2016 03:22:41 -0800

Hello Ryan:

Many thanks for the reply. I see that, the text-attachment containing my 
test-program is not sent to the mail-group, but got filtered out. Hence, 
copying the program-code below:


=================================================================
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Type;
import org.apache.avro.Schema.Field;
import org.apache.avro.generic.* ;
import org.apache.avro.LogicalTypes;
import org.apache.avro.LogicalTypes.*;
import org.apache.hadoop.hive.common.type.HiveDecimal;
import parquet.avro.*;

public class pqtw {

public static Schema makeSchema() {
     List<Field> fields = new ArrayList<Field>();
     fields.add(new Field("name", Schema.create(Type.STRING), null, 
null));
     fields.add(new Field("age", Schema.create(Type.INT), null, null));

     Schema date = 
LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
     fields.add(new Field("doj", date, null, null));

     Schema schema = Schema.createRecord("filecc", null, "parquet", 
false);
     schema.setFields(fields);

     return(schema);
}

public static GenericData.Record makeRecord (Schema schema, String name, 
int age, int doj) {
     GenericData.Record record = new GenericData.Record(schema);
     record.put("name", name);
     record.put("age", age);
     record.put("doj", doj);
     return(record);
}

public static void main(String[] args) throws IOException,

    InterruptedException, ClassNotFoundException {

        String pqfile = "/tmp/pqtfile1";

        try {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.getLocal(conf);

        Schema schema = makeSchema() ;
        GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
        AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile), 
schema);
        writer.write(rec);
        writer.close();
        }
        catch (Exception e)
        {
                e.printStackTrace();
        }
    }
}
=================================================================

With the above logic, I could write the data to parquet-file. However, 
when I load the same into a hive-table & select columns, I could select 
the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully, but 
select of "date" column failed with the error given below:

--------------------------------------------------------------------------------
hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS 
PARQUET ;
OK
Time taken: 0.369 seconds
hive> load data local inpath '/tmp/pqtfile1' into table PT1;
hive> SELECT name,age from PT1;
OK
abcd    21
Time taken: 0.311 seconds, Fetched: 1 row(s)
hive> SELECT doj from PT1;
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be 
cast to org.apache.hadoop.hive.serde2.io.DateWritable
Time taken: 0.167 seconds
hive>
--------------------------------------------------------------------------------

Basically, for "date datatype", I am trying to pass an integer-value (for 
the # of days from Unix epoch, 1 January 1970, so that the date falls 
somewhere around 2011..etc). Is this the correct approach to process date 
data (or is there any other approach / API to do it) ? Could you please 
let me know your inputs, in this regard ?

Thanks,
 Ravi



From:   Ryan Blue <[email protected]>
To:     Parquet Dev <[email protected]>
Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas 
Mudigonda/India/IBM@IBMIN
Date:   03/09/2016 10:48 PM
Subject:        Re: How to write "date, timestamp, decimal" data to 
Parquet-files



Hi Ravi,

Not all of the types are fully-implemented yet. I think Hive only has
partial support. If I remember correctly:
* Decimal is supported if the backing primitive type is fixed-length 
binary
* Date and Timestamp are supported, but Time has not been implemented yet

For object models you can build applications on (instead of those embedded
in SQL), only Avro objects can support those types through its 
LogicalTypes
API. That API has been implemented in parquet-avro, but not yet committed.
I would like for this feature to make it into 1.9.0. If you want to test 
in
the mean time, check out the pull request:

  https://github.com/apache/parquet-mr/pull/318

rb

On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <[email protected]>
wrote:

> Hello,
>
> I am Ravi Tatapudi, from IBM-India. I am working on a simple test-tool,
> that writes data to Parquet-files, which can be imported into 
hive-tables.
> Pl. find attached sample-program, which writes simple parquet-data-file:
>
>
>
> Using the above program, I could create "parquet-files" with data-types:
> INT, LONG, STRING, Boolean...etc (i.e., basically all data-types 
supported
> by "org.apache.avro.Schema.Type) & load it into "hive" tables
> successfully.
>
> Now, I am trying to figure out, how to write "date, timestamp, decimal
> data" into parquet-files.  In this context, I request you provide the
> possible options (and/or sample-program, if any..), in this regard.
>
> Thanks,
>  Ravi
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: How to write "date, timestamp, decimal" data to Parquet-files

Reply via email to