Thanks Ryan, for the info. Regards, Ravi
From: Ryan Blue <[email protected]> To: Parquet Dev <[email protected]> Date: 04/05/2016 09:07 PM Subject: Re: How to write "date, timestamp, decimal" data to Parquet-files Ravi, The only breaking API changes were the renamed packages between 1.6.0 and 1.7.0. Other changes are binary compatible and we have no plans to deprecate the API you're using. For the release date, I don't know yet. We haven't closed out all of the 1.9.0 issues yet. rb On Tue, Apr 5, 2016 at 5:35 AM, Ravi Tatapudi <[email protected]> wrote: > Hello Ryan: > > Regarding my question on compatibility between versions: 1.6.0 & 1.8.2": > > My apologies for the confusion caused. After investigating further, I > realized that, the functionality is now in different JARs. With the > version: 1.6.0, I only included the JAR-file: "parquet-avro-1.6.0.jar" > during build & execution of the programs. > > Now, I see that, I should include the JARs: parquet-avro-1.8.2.jar, > parquet-hadoop-1.8.2.jar at build-time & include the JARs: > parquet-format-2.3.1.jar, parquet-column-1.8.2.jar, > parquet-common-1.8.2.jar, parquet-encoding-1.8.2.jar, for running the > programs). After doing that, I could build my old applications > successfully (of course, I had to change some of the import-statements > from "import parquet.avro" to "import org.apache.parquet.avro"...etc) & > run the tests successfully. > > So, my outstanding queries are: > > 1) I believe, now all my tests are using the "depricatedAPI" for > AvroParquetWriter. If you have a sample-program using the latest-approach, > I request you to point me to the same. > 2) If you are aware of any approximate date (or month) as to, when > "parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include > this fix)" would be officially released (for example: by "june 2016" or > "dec 2016" or later), then I request you to please let me know. It would > be very helpful, for my planning. > > Many thanks for your support & help, in this regard. > > Thanks, > Ravi > > > > From: Ravi Tatapudi/India/IBM > To: [email protected] > Date: 04/05/2016 04:29 PM > Subject: Re: How to write "date, timestamp, decimal" data to > Parquet-files > > > Hello Ryan: > > I have downloaded the source via the "pull-request-URL: > https://github.com/apache/parquet-mr/pull/318" (did a "fork" & downloaded > the source-ZIP-file) & built it using maven. The build completed > successfully & I got the file: "parquet-avro-1.8.2-SNAPSHOT.jar". When I > tried to verify "date" data type using this JAR-file, I realized that, the > existing test-programs are failing with build with this new JAR. > > So far, I have my test-programs built (and run) using > "parquet-avro-1.6.0.jar". Now, when I try to re-build the test-programs > using "parquet-avro-1.8.2-SNAPSHOT.jar", I see that, the builds failed. > After going thro' the source-code, I realized that, there are many changes > in the API, between "1.6.0" & "1.8.2", because of which the > sample-programs that built with "1.6.0" are not building now. (It looks > like, now the "AvroParquetWriter" doesn't have the methods: "write", > "close"...etc, but using some other approach. Do you know, why these > methods are removed completely & made incompatible with parquet-avro-1.6.0 > ?) > > Pl. find below a sample parquet-write program, which is now failing with > "parquet-avro-1.8.2-snapshot.jar". Do you have any sample > parquet-write-program that works with "parquet-avro-1.8.2.jar" (to write > primitive data types such as: "int", "char"..etc, to a parquet-file, as > shown in the below example) ? If yes, could you please point me to the > same. > > > ================================================================================================= > public static Schema makeSchema() { > List<Field> fields = new ArrayList<Field>(); > fields.add(new Field("name", Schema.create(Type.STRING), null, > null)); > fields.add(new Field("age", Schema.create(Type.INT), null, null)); > fields.add(new Field("dept", Schema.create(Type.STRING), null, > null)); > > Schema schema = Schema.createRecord("filecc", null, "parquet", > false); > schema.setFields(fields); > return(schema); > } > > public static GenericData.Record makeRecord (Schema schema, String name, > int age, String dept) { > GenericData.Record record = new GenericData.Record(schema); > record.put("name", name); > record.put("age", age); > record.put("dept", dept); > return(record); > } > > public static void main(String[] args) throws IOException, > InterruptedException, ClassNotFoundException { > > String pqfile = "/tmp/pqtfile1"; > try { > conf = new Configuration(); > FileSystem fs = FileSystem.getLocal(conf); > > Schema schema = makeSchema() ; > GenericData.Record rec = makeRecord(schema,"Person A", 21,"ED2") ; > AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile), > schema); > writer.write(rec); > writer.close() ; > > } catch (Exception e) { e.printStackTrace(); } > > ================================================================================================= > > Thanks, > Ravi > > > > > From: Ravi Tatapudi/India/IBM > To: [email protected] > Date: 04/05/2016 10:53 AM > Subject: Re: How to write "date, timestamp, decimal" data to > Parquet-files > > > Hello Ryan: > > Many thanks for the inputs. I will try to build it today & see how it > goes. > > Could you please let me know, any approximate date (or month) as to, when > "parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include > this fix)" would be officially released (for example: by "june 2016" or > "dec 2016" or later) ? It would be very helpful, for my planning. > > Thanks, > Ravi > > > > > From: Ryan Blue <[email protected]> > To: Parquet Dev <[email protected]> > Date: 04/04/2016 10:05 PM > Subject: Re: How to write "date, timestamp, decimal" data to > Parquet-files > > > > I don't think you can get the artifacts produced by our CI builds, but you > can check out the branch and build it using instructions in the > repository. > > On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <[email protected]> > wrote: > > > Hello Ryan: > > > > Regarding the support for "date, timestamp, decimal" data types for > > Parquet-files: > > > > In your earlier mail, you have mentioned the pull-request-URL: > > https://github.com/apache/parquet-mr/pull/318 has the necessary support > > for these data-types (and that it would be released as part of > > parquet-avro-release:1.9.0). > > > > I see that, this fix is included in build# 1247 (& above?). How to get > > this build (or the latest-build), that includes the JAR-file: > > "parquet-avro" including the support for "date,timestamp"..etc. ? Could > > you please let me know. > > > > Thanks, > > Ravi > > > > > > > > From: Ryan Blue <[email protected]> > > To: Parquet Dev <[email protected]> > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas > > Mudigonda/India/IBM@IBMIN > > Date: 03/14/2016 09:56 PM > > Subject: Re: How to write "date, timestamp, decimal" data to > > Parquet-files > > > > > > > > Ravi, > > > > Support for those types in parquet-avro hasn't been committed yet. It's > > implemented in the branch I pointed you to. If you want to use released > > versions, it should be out in 1.9.0. > > > > rb > > > > On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi > <[email protected]> > > wrote: > > > > > Hello Ryan: > > > > > > Thanks for the inputs. > > > > > > I am building & running the test-application, primarily using the > > > following JAR-files (for Avro, Parquet-Avro & Hive APIs): > > > > > > 1) avro-1.8.0.jar > > > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the > > > maven-repository-URL: > > > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0) > > > 3) hive-exec-1.2.1.jar > > > > > > Am I supposed to build/run the test, using a different version of the > > > JAR-files ? Could you please let me know. > > > > > > Thanks, > > > Ravi > > > > > > > > > > > > > > > From: Ryan Blue <[email protected]> > > > To: Parquet Dev <[email protected]> > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas > > > Mudigonda/India/IBM@IBMIN > > > Date: 03/11/2016 10:54 PM > > > Subject: Re: How to write "date, timestamp, decimal" data to > > > Parquet-files > > > > > > > > > > > > Yes, it is supported in 1.2.1. It went in here: > > > > > > > > > > > > > > > > > > https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b > > > > > > > > > > > > Are you using a version of Parquet with that pull request in it? Also, > > if > > > you're using CDH this may not work. > > > > > > rb > > > > > > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi > > <[email protected]> > > > wrote: > > > > > > > Hello Ryan: > > > > > > > > I am using hive-version: 1.2.1, as indicated below: > > > > > > > > -------------------------------------- > > > > $ hive --version > > > > Hive 1.2.1 > > > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r > > > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558 > > > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015 > > > > From source with checksum ab480aca41b24a9c3751b8c023338231 > > > > $ > > > > -------------------------------------- > > > > > > > > As I understand, this version of "hive" supports "date" datatype. > > right > > > ?. > > > > Do you want me to re-test using any other higher-version of hive ? > Pl. > > > let > > > > me know your thoughts. > > > > > > > > Thanks, > > > > Ravi > > > > > > > > > > > > > > > > From: Ryan Blue <[email protected]> > > > > To: Parquet Dev <[email protected]> > > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas > > > > Mudigonda/India/IBM@IBMIN > > > > Date: 03/11/2016 06:18 AM > > > > Subject: Re: How to write "date, timestamp, decimal" data to > > > > Parquet-files > > > > > > > > > > > > > > > > What version of Hive are you using? You should make sure date is > > > supported > > > > there. > > > > > > > > rb > > > > > > > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi > > > <[email protected]> > > > > wrote: > > > > > > > > > Hello Ryan: > > > > > > > > > > Many thanks for the reply. I see that, the text-attachment > > containing > > > my > > > > > test-program is not sent to the mail-group, but got filtered out. > > > Hence, > > > > > copying the program-code below: > > > > > > > > > > ================================================================= > > > > > import java.io.IOException; > > > > > import java.util.*; > > > > > import org.apache.hadoop.conf.Configuration; > > > > > import org.apache.hadoop.fs.FileSystem; > > > > > import org.apache.hadoop.fs.Path; > > > > > import org.apache.avro.Schema; > > > > > import org.apache.avro.Schema.Type; > > > > > import org.apache.avro.Schema.Field; > > > > > import org.apache.avro.generic.* ; > > > > > import org.apache.avro.LogicalTypes; > > > > > import org.apache.avro.LogicalTypes.*; > > > > > import org.apache.hadoop.hive.common.type.HiveDecimal; > > > > > import parquet.avro.*; > > > > > > > > > > public class pqtw { > > > > > > > > > > public static Schema makeSchema() { > > > > > List<Field> fields = new ArrayList<Field>(); > > > > > fields.add(new Field("name", Schema.create(Type.STRING), > null, > > > > > null)); > > > > > fields.add(new Field("age", Schema.create(Type.INT), null, > > > null)); > > > > > > > > > > Schema date = > > > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ; > > > > > fields.add(new Field("doj", date, null, null)); > > > > > > > > > > Schema schema = Schema.createRecord("filecc", null, > "parquet", > > > > > false); > > > > > schema.setFields(fields); > > > > > > > > > > return(schema); > > > > > } > > > > > > > > > > public static GenericData.Record makeRecord (Schema schema, String > > > name, > > > > > int age, int doj) { > > > > > GenericData.Record record = new GenericData.Record(schema); > > > > > record.put("name", name); > > > > > record.put("age", age); > > > > > record.put("doj", doj); > > > > > return(record); > > > > > } > > > > > > > > > > public static void main(String[] args) throws IOException, > > > > > > > > > > InterruptedException, ClassNotFoundException { > > > > > > > > > > String pqfile = "/tmp/pqtfile1"; > > > > > > > > > > try { > > > > > > > > > > Configuration conf = new Configuration(); > > > > > FileSystem fs = FileSystem.getLocal(conf); > > > > > > > > > > Schema schema = makeSchema() ; > > > > > GenericData.Record rec = makeRecord(schema,"abcd", > 21,15000) > > ; > > > > > AvroParquetWriter writer = new AvroParquetWriter(new > > > > Path(pqfile), > > > > > schema); > > > > > writer.write(rec); > > > > > writer.close(); > > > > > } > > > > > catch (Exception e) > > > > > { > > > > > e.printStackTrace(); > > > > > } > > > > > } > > > > > } > > > > > ================================================================= > > > > > > > > > > With the above logic, I could write the data to parquet-file. > > However, > > > > > when I load the same into a hive-table & select columns, I could > > > select > > > > > the columns: "name", "age" (i.e., VARCHAR, INT columns) > > successfully, > > > > but > > > > > select of "date" column failed with the error given below: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------------------------- > > > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) > STORED > > AS > > > > > PARQUET ; > > > > > OK > > > > > Time taken: 0.369 seconds > > > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1; > > > > > hive> SELECT name,age from PT1; > > > > > OK > > > > > abcd 21 > > > > > Time taken: 0.311 seconds, Fetched: 1 row(s) > > > > > hive> SELECT doj from PT1; > > > > > OK > > > > > Failed with exception > > > > > > > java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: > > > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > > cannot > > > be > > > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable > > > > > Time taken: 0.167 seconds > > > > > hive> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------------------------- > > > > > > > > > > Basically, for "date datatype", I am trying to pass an > integer-value > > > > (for > > > > > the # of days from Unix epoch, 1 January 1970, so that the date > > falls > > > > > somewhere around 2011..etc). Is this the correct approach to > process > > > > date > > > > > data (or is there any other approach / API to do it) ? Could you > > > please > > > > > let me know your inputs, in this regard ? > > > > > > > > > > Thanks, > > > > > Ravi > > > > > > > > > > > > > > > > > > > > From: Ryan Blue <[email protected]> > > > > > To: Parquet Dev <[email protected]> > > > > > Cc: Nagesh R Charka/India/IBM@IBMIN, Srinivas > > > > > Mudigonda/India/IBM@IBMIN > > > > > Date: 03/09/2016 10:48 PM > > > > > Subject: Re: How to write "date, timestamp, decimal" data > to > > > > > Parquet-files > > > > > > > > > > > > > > > > > > > > Hi Ravi, > > > > > > > > > > Not all of the types are fully-implemented yet. I think Hive only > > has > > > > > partial support. If I remember correctly: > > > > > * Decimal is supported if the backing primitive type is > fixed-length > > > > > binary > > > > > * Date and Timestamp are supported, but Time has not been > > implemented > > > > yet > > > > > > > > > > For object models you can build applications on (instead of those > > > > embedded > > > > > in SQL), only Avro objects can support those types through its > > > > > LogicalTypes > > > > > API. That API has been implemented in parquet-avro, but not yet > > > > committed. > > > > > I would like for this feature to make it into 1.9.0. If you want > to > > > test > > > > > in > > > > > the mean time, check out the pull request: > > > > > > > > > > https://github.com/apache/parquet-mr/pull/318 > > > > > > > > > > rb > > > > > > > > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi > > > <[email protected]> > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple > > > > test-tool, > > > > > > that writes data to Parquet-files, which can be imported into > > > > > hive-tables. > > > > > > Pl. find attached sample-program, which writes simple > > > > parquet-data-file: > > > > > > > > > > > > > > > > > > > > > > > > Using the above program, I could create "parquet-files" with > > > > data-types: > > > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types > > > > > supported > > > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables > > > > > > successfully. > > > > > > > > > > > > Now, I am trying to figure out, how to write "date, timestamp, > > > decimal > > > > > > data" into parquet-files. In this context, I request you > provide > > > the > > > > > > possible options (and/or sample-program, if any..), in this > > regard. > > > > > > > > > > > > Thanks, > > > > > > Ravi > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ryan Blue > > > > > Software Engineer > > > > > Netflix > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Ryan Blue > > > > Software Engineer > > > > Netflix > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > > > -- Ryan Blue Software Engineer Netflix
