Hello Ryan:
Many thanks for the reply. I see that, the text-attachment containing my
test-program is not sent to the mail-group, but got filtered out. Hence,
copying the program-code below:
=================================================================
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Type;
import org.apache.avro.Schema.Field;
import org.apache.avro.generic.* ;
import org.apache.avro.LogicalTypes;
import org.apache.avro.LogicalTypes.*;
import org.apache.hadoop.hive.common.type.HiveDecimal;
import parquet.avro.*;
public class pqtw {
public static Schema makeSchema() {
List<Field> fields = new ArrayList<Field>();
fields.add(new Field("name", Schema.create(Type.STRING), null,
null));
fields.add(new Field("age", Schema.create(Type.INT), null, null));
Schema date =
LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
fields.add(new Field("doj", date, null, null));
Schema schema = Schema.createRecord("filecc", null, "parquet",
false);
schema.setFields(fields);
return(schema);
}
public static GenericData.Record makeRecord (Schema schema, String name,
int age, int doj) {
GenericData.Record record = new GenericData.Record(schema);
record.put("name", name);
record.put("age", age);
record.put("doj", doj);
return(record);
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
String pqfile = "/tmp/pqtfile1";
try {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.getLocal(conf);
Schema schema = makeSchema() ;
GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ;
AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile),
schema);
writer.write(rec);
writer.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
=================================================================
With the above logic, I could write the data to parquet-file. However,
when I load the same into a hive-table & select columns, I could select
the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully, but
select of "date" column failed with the error given below:
--------------------------------------------------------------------------------
hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS
PARQUET ;
OK
Time taken: 0.369 seconds
hive> load data local inpath '/tmp/pqtfile1' into table PT1;
hive> SELECT name,age from PT1;
OK
abcd 21
Time taken: 0.311 seconds, Fetched: 1 row(s)
hive> SELECT doj from PT1;
OK
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be
cast to org.apache.hadoop.hive.serde2.io.DateWritable
Time taken: 0.167 seconds
hive>
--------------------------------------------------------------------------------
Basically, for "date datatype", I am trying to pass an integer-value (for
the # of days from Unix epoch, 1 January 1970, so that the date falls
somewhere around 2011..etc). Is this the correct approach to process date
data (or is there any other approach / API to do it) ? Could you please
let me know your inputs, in this regard ?
Thanks,
Ravi
From: Ryan Blue <[email protected]>
To: Parquet Dev <[email protected]>
Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas
Mudigonda/India/IBM@IBMIN
Date: 03/09/2016 10:48 PM
Subject: Re: How to write "date, timestamp, decimal" data to
Parquet-files
Hi Ravi,
Not all of the types are fully-implemented yet. I think Hive only has
partial support. If I remember correctly:
* Decimal is supported if the backing primitive type is fixed-length
binary
* Date and Timestamp are supported, but Time has not been implemented yet
For object models you can build applications on (instead of those embedded
in SQL), only Avro objects can support those types through its
LogicalTypes
API. That API has been implemented in parquet-avro, but not yet committed.
I would like for this feature to make it into 1.9.0. If you want to test
in
the mean time, check out the pull request:
https://github.com/apache/parquet-mr/pull/318
rb
On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi <[email protected]>
wrote:
> Hello,
>
> I am Ravi Tatapudi, from IBM-India. I am working on a simple test-tool,
> that writes data to Parquet-files, which can be imported into
hive-tables.
> Pl. find attached sample-program, which writes simple parquet-data-file:
>
>
>
> Using the above program, I could create "parquet-files" with data-types:
> INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
supported
> by "org.apache.avro.Schema.Type) & load it into "hive" tables
> successfully.
>
> Now, I am trying to figure out, how to write "date, timestamp, decimal
> data" into parquet-files. In this context, I request you provide the
> possible options (and/or sample-program, if any..), in this regard.
>
> Thanks,
> Ravi
>
--
Ryan Blue
Software Engineer
Netflix