Hello Ryan: Regarding the support for "date, timestamp, decimal" data types for Parquet-files:
In your earlier mail, you have mentioned the pull-request-URL: https://github.com/apache/parquet-mr/pull/318 has the necessary support for these data-types (and that it would be released as part of parquet-avro-release:1.9.0). I see that, this fix is included in build# 1247 (& above?). How to get this build (or the latest-build), that includes the JAR-file: "parquet-avro" including the support for "date,timestamp"..etc. ? Could you please let me know. Thanks, Ravi From: Ryan Blue <[email protected]> To: Parquet Dev <[email protected]> Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas Mudigonda/India/IBM@IBMIN Date: 03/14/2016 09:56 PM Subject: Re: How to write "date, timestamp, decimal" data to Parquet-files Ravi, Support for those types in parquet-avro hasn't been committed yet. It's implemented in the branch I pointed you to. If you want to use released versions, it should be out in 1.9.0. rb On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi <[email protected]> wrote: > Hello Ryan: > > Thanks for the inputs. > > I am building & running the test-application, primarily using the > following JAR-files (for Avro, Parquet-Avro & Hive APIs): > > 1) avro-1.8.0.jar > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the > maven-repository-URL: > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0) > 3) hive-exec-1.2.1.jar > > Am I supposed to build/run the test, using a different version of the > JAR-files ? Could you please let me know. > > Thanks, > Ravi > > > > > From: Ryan Blue <[email protected]> > To: Parquet Dev <[email protected]> > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas > Mudigonda/India/IBM@IBMIN > Date: 03/11/2016 10:54 PM > Subject: Re: How to write "date, timestamp, decimal" data to > Parquet-files > > > > Yes, it is supported in 1.2.1. It went in here: > > > > https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b > > > Are you using a version of Parquet with that pull request in it? Also, if > you're using CDH this may not work. > > rb > > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi <[email protected]> > wrote: > > > Hello Ryan: > > > > I am using hive-version: 1.2.1, as indicated below: > > > > -------------------------------------- > > $ hive --version > > Hive 1.2.1 > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558 > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015 > > From source with checksum ab480aca41b24a9c3751b8c023338231 > > $ > > -------------------------------------- > > > > As I understand, this version of "hive" supports "date" datatype. right > ?. > > Do you want me to re-test using any other higher-version of hive ? Pl. > let > > me know your thoughts. > > > > Thanks, > > Ravi > > > > > > > > From: Ryan Blue <[email protected]> > > To: Parquet Dev <[email protected]> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas > > Mudigonda/India/IBM@IBMIN > > Date: 03/11/2016 06:18 AM > > Subject: Re: How to write "date, timestamp, decimal" data to > > Parquet-files > > > > > > > > What version of Hive are you using? You should make sure date is > supported > > there. > > > > rb > > > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi > <[email protected]> > > wrote: > > > > > Hello Ryan: > > > > > > Many thanks for the reply. I see that, the text-attachment containing > my > > > test-program is not sent to the mail-group, but got filtered out. > Hence, > > > copying the program-code below: > > > > > > ================================================================= > > > import java.io.IOException; > > > import java.util.*; > > > import org.apache.hadoop.conf.Configuration; > > > import org.apache.hadoop.fs.FileSystem; > > > import org.apache.hadoop.fs.Path; > > > import org.apache.avro.Schema; > > > import org.apache.avro.Schema.Type; > > > import org.apache.avro.Schema.Field; > > > import org.apache.avro.generic.* ; > > > import org.apache.avro.LogicalTypes; > > > import org.apache.avro.LogicalTypes.*; > > > import org.apache.hadoop.hive.common.type.HiveDecimal; > > > import parquet.avro.*; > > > > > > public class pqtw { > > > > > > public static Schema makeSchema() { > > > List<Field> fields = new ArrayList<Field>(); > > > fields.add(new Field("name", Schema.create(Type.STRING), null, > > > null)); > > > fields.add(new Field("age", Schema.create(Type.INT), null, > null)); > > > > > > Schema date = > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ; > > > fields.add(new Field("doj", date, null, null)); > > > > > > Schema schema = Schema.createRecord("filecc", null, "parquet", > > > false); > > > schema.setFields(fields); > > > > > > return(schema); > > > } > > > > > > public static GenericData.Record makeRecord (Schema schema, String > name, > > > int age, int doj) { > > > GenericData.Record record = new GenericData.Record(schema); > > > record.put("name", name); > > > record.put("age", age); > > > record.put("doj", doj); > > > return(record); > > > } > > > > > > public static void main(String[] args) throws IOException, > > > > > > InterruptedException, ClassNotFoundException { > > > > > > String pqfile = "/tmp/pqtfile1"; > > > > > > try { > > > > > > Configuration conf = new Configuration(); > > > FileSystem fs = FileSystem.getLocal(conf); > > > > > > Schema schema = makeSchema() ; > > > GenericData.Record rec = makeRecord(schema,"abcd", 21,15000) ; > > > AvroParquetWriter writer = new AvroParquetWriter(new > > Path(pqfile), > > > schema); > > > writer.write(rec); > > > writer.close(); > > > } > > > catch (Exception e) > > > { > > > e.printStackTrace(); > > > } > > > } > > > } > > > ================================================================= > > > > > > With the above logic, I could write the data to parquet-file. However, > > > when I load the same into a hive-table & select columns, I could > select > > > the columns: "name", "age" (i.e., VARCHAR, INT columns) successfully, > > but > > > select of "date" column failed with the error given below: > > > > > > > > > > > > > > > -------------------------------------------------------------------------------- > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) STORED AS > > > PARQUET ; > > > OK > > > Time taken: 0.369 seconds > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1; > > > hive> SELECT name,age from PT1; > > > OK > > > abcd 21 > > > Time taken: 0.311 seconds, Fetched: 1 row(s) > > > hive> SELECT doj from PT1; > > > OK > > > Failed with exception > > > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot > be > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable > > > Time taken: 0.167 seconds > > > hive> > > > > > > > > > > > > -------------------------------------------------------------------------------- > > > > > > Basically, for "date datatype", I am trying to pass an integer-value > > (for > > > the # of days from Unix epoch, 1 January 1970, so that the date falls > > > somewhere around 2011..etc). Is this the correct approach to process > > date > > > data (or is there any other approach / API to do it) ? Could you > please > > > let me know your inputs, in this regard ? > > > > > > Thanks, > > > Ravi > > > > > > > > > > > > From: Ryan Blue <[email protected]> > > > To: Parquet Dev <[email protected]> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas > > > Mudigonda/India/IBM@IBMIN > > > Date: 03/09/2016 10:48 PM > > > Subject: Re: How to write "date, timestamp, decimal" data to > > > Parquet-files > > > > > > > > > > > > Hi Ravi, > > > > > > Not all of the types are fully-implemented yet. I think Hive only has > > > partial support. If I remember correctly: > > > * Decimal is supported if the backing primitive type is fixed-length > > > binary > > > * Date and Timestamp are supported, but Time has not been implemented > > yet > > > > > > For object models you can build applications on (instead of those > > embedded > > > in SQL), only Avro objects can support those types through its > > > LogicalTypes > > > API. That API has been implemented in parquet-avro, but not yet > > committed. > > > I would like for this feature to make it into 1.9.0. If you want to > test > > > in > > > the mean time, check out the pull request: > > > > > > https://github.com/apache/parquet-mr/pull/318 > > > > > > rb > > > > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi > <[email protected]> > > > wrote: > > > > > > > Hello, > > > > > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple > > test-tool, > > > > that writes data to Parquet-files, which can be imported into > > > hive-tables. > > > > Pl. find attached sample-program, which writes simple > > parquet-data-file: > > > > > > > > > > > > > > > > Using the above program, I could create "parquet-files" with > > data-types: > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types > > > supported > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables > > > > successfully. > > > > > > > > Now, I am trying to figure out, how to write "date, timestamp, > decimal > > > > data" into parquet-files. In this context, I request you provide > the > > > > possible options (and/or sample-program, if any..), in this regard. > > > > > > > > Thanks, > > > > Ravi > > > > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > > > -- Ryan Blue Software Engineer Netflix
