Hello Ryan:

Regarding my question on compatibility between versions: 1.6.0 & 1.8.2": 

My apologies for the confusion caused. After investigating further, I 
realized that, the functionality is now in different JARs. With the 
version: 1.6.0, I only included the JAR-file: "parquet-avro-1.6.0.jar" 
during build & execution of the programs.

Now, I see that, I should include the JARs: parquet-avro-1.8.2.jar, 
parquet-hadoop-1.8.2.jar at build-time & include the JARs: 
parquet-format-2.3.1.jar, parquet-column-1.8.2.jar, 
parquet-common-1.8.2.jar, parquet-encoding-1.8.2.jar, for running the 
programs). After doing that, I could build my old applications 
successfully (of course, I had to change some of the import-statements 
from "import parquet.avro" to "import org.apache.parquet.avro"...etc) & 
run the tests successfully.

So, my outstanding queries are:

1) I believe, now all my tests are using the "depricatedAPI" for 
AvroParquetWriter. If you have a sample-program using the latest-approach, 
I request you to point me to the same.
2) If you are aware of any approximate date (or month) as to, when 
"parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include 
this fix)" would be officially released (for example: by "june 2016" or 
"dec 2016" or later), then I request you to please let me know. It would 
be very helpful, for my planning.

Many thanks for your support & help, in this regard.

Thanks,
 Ravi



From:   Ravi Tatapudi/India/IBM
To:     [email protected]
Date:   04/05/2016 04:29 PM
Subject:        Re: How to write "date, timestamp, decimal" data to 
Parquet-files


Hello Ryan:

I have downloaded the source via the "pull-request-URL: 
https://github.com/apache/parquet-mr/pull/318"; (did a "fork" & downloaded 
the source-ZIP-file) & built it using maven. The build completed 
successfully & I got the file: "parquet-avro-1.8.2-SNAPSHOT.jar". When I 
tried to verify "date" data type using this JAR-file, I realized that, the 
existing test-programs are failing with build with this new JAR. 

So far, I have my test-programs built (and run) using 
"parquet-avro-1.6.0.jar". Now, when I try to re-build the test-programs 
using "parquet-avro-1.8.2-SNAPSHOT.jar", I see that, the builds failed. 
After going thro' the source-code, I realized that, there are many changes 
in the API, between "1.6.0" & "1.8.2", because of which the 
sample-programs that built with "1.6.0" are not building now. (It looks 
like, now the "AvroParquetWriter" doesn't have the methods: "write", 
"close"...etc, but using some other approach. Do you know, why these 
methods are removed completely & made incompatible with parquet-avro-1.6.0 
?)

Pl. find below a sample parquet-write program, which is now failing with 
"parquet-avro-1.8.2-snapshot.jar". Do you have any sample 
parquet-write-program that works with "parquet-avro-1.8.2.jar" (to write 
primitive data types such as: "int", "char"..etc, to a parquet-file, as 
shown in the below example) ? If yes, could you please point me to the 
same.

=================================================================================================
public static Schema makeSchema() {
     List<Field> fields = new ArrayList<Field>();
     fields.add(new Field("name", Schema.create(Type.STRING), null, 
null));
     fields.add(new Field("age", Schema.create(Type.INT), null, null));
     fields.add(new Field("dept", Schema.create(Type.STRING), null, 
null));

     Schema schema = Schema.createRecord("filecc", null, "parquet", 
false);
     schema.setFields(fields);
     return(schema);
}

public static GenericData.Record makeRecord (Schema schema, String name, 
int age, String dept) {
     GenericData.Record record = new GenericData.Record(schema);
     record.put("name", name);
     record.put("age", age);
     record.put("dept", dept);
     return(record);
}

public static void main(String[] args) throws IOException, 
InterruptedException, ClassNotFoundException {

        String pqfile = "/tmp/pqtfile1";
        try {
        conf = new Configuration();
        FileSystem fs = FileSystem.getLocal(conf);

        Schema schema = makeSchema() ;
        GenericData.Record rec = makeRecord(schema,"Person A", 21,"ED2") ;
        AvroParquetWriter writer = new AvroParquetWriter(new Path(pqfile), 
schema);
        writer.write(rec);
        writer.close() ;

} catch (Exception e) { e.printStackTrace(); }
=================================================================================================

Thanks,
 Ravi




From:   Ravi Tatapudi/India/IBM
To:     [email protected]
Date:   04/05/2016 10:53 AM
Subject:        Re: How to write "date, timestamp, decimal" data to 
Parquet-files


Hello Ryan:

Many thanks for the inputs. I will try to build it today & see how it 
goes. 

Could you please let me know, any approximate date (or month) as to, when 
"parquet-avro-1.9.0 (or any other parquet-avro-1.8.x, that would include 
this fix)" would be officially released (for example: by "june 2016" or 
"dec 2016" or later) ? It would be very helpful, for my planning.

Thanks,
 Ravi




From:   Ryan Blue <[email protected]>
To:     Parquet Dev <[email protected]>
Date:   04/04/2016 10:05 PM
Subject:        Re: How to write "date, timestamp, decimal" data to 
Parquet-files



I don't think you can get the artifacts produced by our CI builds, but you
can check out the branch and build it using instructions in the 
repository.

On Mon, Apr 4, 2016 at 5:39 AM, Ravi Tatapudi <[email protected]>
wrote:

> Hello Ryan:
>
> Regarding the support for "date, timestamp, decimal" data types for
> Parquet-files:
>
> In your earlier mail, you have mentioned the pull-request-URL:
> https://github.com/apache/parquet-mr/pull/318 has the necessary support
> for these data-types (and that it would be released as part of
> parquet-avro-release:1.9.0).
>
> I see that, this fix is included in build# 1247 (& above?). How to get
> this build (or the latest-build), that includes the JAR-file:
> "parquet-avro" including the support for "date,timestamp"..etc. ? Could
> you please let me know.
>
> Thanks,
>  Ravi
>
>
>
> From:   Ryan Blue <[email protected]>
> To:     Parquet Dev <[email protected]>
> Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> Mudigonda/India/IBM@IBMIN
> Date:   03/14/2016 09:56 PM
> Subject:        Re: How to write "date, timestamp, decimal" data to
> Parquet-files
>
>
>
> Ravi,
>
> Support for those types in parquet-avro hasn't been committed yet. It's
> implemented in the branch I pointed you to. If you want to use released
> versions, it should be out in 1.9.0.
>
> rb
>
> On Sun, Mar 13, 2016 at 9:52 PM, Ravi Tatapudi 
<[email protected]>
> wrote:
>
> > Hello Ryan:
> >
> > Thanks for the inputs.
> >
> > I am building & running the test-application, primarily using the
> > following JAR-files (for Avro, Parquet-Avro & Hive APIs):
> >
> > 1) avro-1.8.0.jar
> > 2) parquet-avro-1.6.0.jar (This is the latest one, found in the
> > maven-repository-URL:
> > http://mvnrepository.com/artifact/com.twitter/parquet-avro/1.6.0)
> > 3) hive-exec-1.2.1.jar
> >
> > Am I supposed to build/run the test, using a different version of the
> > JAR-files ? Could you please let me know.
> >
> > Thanks,
> >  Ravi
> >
> >
> >
> >
> > From:   Ryan Blue <[email protected]>
> > To:     Parquet Dev <[email protected]>
> > Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > Mudigonda/India/IBM@IBMIN
> > Date:   03/11/2016 10:54 PM
> > Subject:        Re: How to write "date, timestamp, decimal" data to
> > Parquet-files
> >
> >
> >
> > Yes, it is supported in 1.2.1. It went in here:
> >
> >
> >
> >
>
> 
https://github.com/apache/hive/commit/912b4897ed457cfc447995b124ae84078287530b

>
> >
> >
> > Are you using a version of Parquet with that pull request in it? Also,
> if
> > you're using CDH this may not work.
> >
> > rb
> >
> > On Fri, Mar 11, 2016 at 12:40 AM, Ravi Tatapudi
> <[email protected]>
> > wrote:
> >
> > > Hello Ryan:
> > >
> > > I am using hive-version: 1.2.1, as indicated below:
> > >
> > > --------------------------------------
> > > $ hive --version
> > > Hive 1.2.1
> > > Subversion git://localhost.localdomain/home/sush/dev/hive.git -r
> > > 243e7c1ac39cb7ac8b65c5bc6988f5cc3162f558
> > > Compiled by sush on Fri Jun 19 02:03:48 PDT 2015
> > > From source with checksum ab480aca41b24a9c3751b8c023338231
> > > $
> > > --------------------------------------
> > >
> > > As I understand, this version of "hive" supports "date" datatype.
> right
> > ?.
> > > Do you want me to re-test using any other higher-version of hive ? 
Pl.
> > let
> > > me know your thoughts.
> > >
> > > Thanks,
> > >  Ravi
> > >
> > >
> > >
> > > From:   Ryan Blue <[email protected]>
> > > To:     Parquet Dev <[email protected]>
> > > Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > Mudigonda/India/IBM@IBMIN
> > > Date:   03/11/2016 06:18 AM
> > > Subject:        Re: How to write "date, timestamp, decimal" data to
> > > Parquet-files
> > >
> > >
> > >
> > > What version of Hive are you using? You should make sure date is
> > supported
> > > there.
> > >
> > > rb
> > >
> > > On Thu, Mar 10, 2016 at 3:11 AM, Ravi Tatapudi
> > <[email protected]>
> > > wrote:
> > >
> > > > Hello Ryan:
> > > >
> > > > Many thanks for the reply. I see that, the text-attachment
> containing
> > my
> > > > test-program is not sent to the mail-group, but got filtered out.
> > Hence,
> > > > copying the program-code below:
> > > >
> > > > =================================================================
> > > > import java.io.IOException;
> > > > import java.util.*;
> > > > import org.apache.hadoop.conf.Configuration;
> > > > import org.apache.hadoop.fs.FileSystem;
> > > > import org.apache.hadoop.fs.Path;
> > > > import org.apache.avro.Schema;
> > > > import org.apache.avro.Schema.Type;
> > > > import org.apache.avro.Schema.Field;
> > > > import org.apache.avro.generic.* ;
> > > > import org.apache.avro.LogicalTypes;
> > > > import org.apache.avro.LogicalTypes.*;
> > > > import org.apache.hadoop.hive.common.type.HiveDecimal;
> > > > import parquet.avro.*;
> > > >
> > > > public class pqtw {
> > > >
> > > > public static Schema makeSchema() {
> > > >      List<Field> fields = new ArrayList<Field>();
> > > >      fields.add(new Field("name", Schema.create(Type.STRING), 
null,
> > > > null));
> > > >      fields.add(new Field("age", Schema.create(Type.INT), null,
> > null));
> > > >
> > > >      Schema date =
> > > > LogicalTypes.date().addToSchema(Schema.create(Type.INT)) ;
> > > >      fields.add(new Field("doj", date, null, null));
> > > >
> > > >      Schema schema = Schema.createRecord("filecc", null, 
"parquet",
> > > > false);
> > > >      schema.setFields(fields);
> > > >
> > > >      return(schema);
> > > > }
> > > >
> > > > public static GenericData.Record makeRecord (Schema schema, String
> > name,
> > > > int age, int doj) {
> > > >      GenericData.Record record = new GenericData.Record(schema);
> > > >      record.put("name", name);
> > > >      record.put("age", age);
> > > >      record.put("doj", doj);
> > > >      return(record);
> > > > }
> > > >
> > > > public static void main(String[] args) throws IOException,
> > > >
> > > >     InterruptedException, ClassNotFoundException {
> > > >
> > > >         String pqfile = "/tmp/pqtfile1";
> > > >
> > > >         try {
> > > >
> > > >         Configuration conf = new Configuration();
> > > >         FileSystem fs = FileSystem.getLocal(conf);
> > > >
> > > >         Schema schema = makeSchema() ;
> > > >         GenericData.Record rec = makeRecord(schema,"abcd", 
21,15000)
> ;
> > > >         AvroParquetWriter writer = new AvroParquetWriter(new
> > > Path(pqfile),
> > > > schema);
> > > >         writer.write(rec);
> > > >         writer.close();
> > > >         }
> > > >         catch (Exception e)
> > > >         {
> > > >                 e.printStackTrace();
> > > >         }
> > > >     }
> > > > }
> > > > =================================================================
> > > >
> > > > With the above logic, I could write the data to parquet-file.
> However,
> > > > when I load the same into a hive-table & select columns, I could
> > select
> > > > the columns: "name", "age" (i.e., VARCHAR, INT columns)
> successfully,
> > > but
> > > > select of "date" column failed with the error given below:
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
> 
--------------------------------------------------------------------------------
> > > > hive> CREATE TABLE PT1 (name varchar(10), age int, doj date) 
STORED
> AS
> > > > PARQUET ;
> > > > OK
> > > > Time taken: 0.369 seconds
> > > > hive> load data local inpath '/tmp/pqtfile1' into table PT1;
> > > > hive> SELECT name,age from PT1;
> > > > OK
> > > > abcd    21
> > > > Time taken: 0.311 seconds, Fetched: 1 row(s)
> > > > hive> SELECT doj from PT1;
> > > > OK
> > > > Failed with exception
> > > >
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> > > > java.lang.ClassCastException: org.apache.hadoop.io.IntWritable
> cannot
> > be
> > > > cast to org.apache.hadoop.hive.serde2.io.DateWritable
> > > > Time taken: 0.167 seconds
> > > > hive>
> > > >
> > > >
> > >
> > >
> >
> >
>
> 
--------------------------------------------------------------------------------
> > > >
> > > > Basically, for "date datatype", I am trying to pass an 
integer-value
> > > (for
> > > > the # of days from Unix epoch, 1 January 1970, so that the date
> falls
> > > > somewhere around 2011..etc). Is this the correct approach to 
process
> > > date
> > > > data (or is there any other approach / API to do it) ? Could you
> > please
> > > > let me know your inputs, in this regard ?
> > > >
> > > > Thanks,
> > > >  Ravi
> > > >
> > > >
> > > >
> > > > From:   Ryan Blue <[email protected]>
> > > > To:     Parquet Dev <[email protected]>
> > > > Cc:     Nagesh R Charka/India/IBM@IBMIN, Srinivas
> > > > Mudigonda/India/IBM@IBMIN
> > > > Date:   03/09/2016 10:48 PM
> > > > Subject:        Re: How to write "date, timestamp, decimal" data 
to
> > > > Parquet-files
> > > >
> > > >
> > > >
> > > > Hi Ravi,
> > > >
> > > > Not all of the types are fully-implemented yet. I think Hive only
> has
> > > > partial support. If I remember correctly:
> > > > * Decimal is supported if the backing primitive type is 
fixed-length
> > > > binary
> > > > * Date and Timestamp are supported, but Time has not been
> implemented
> > > yet
> > > >
> > > > For object models you can build applications on (instead of those
> > > embedded
> > > > in SQL), only Avro objects can support those types through its
> > > > LogicalTypes
> > > > API. That API has been implemented in parquet-avro, but not yet
> > > committed.
> > > > I would like for this feature to make it into 1.9.0. If you want 
to
> > test
> > > > in
> > > > the mean time, check out the pull request:
> > > >
> > > >   https://github.com/apache/parquet-mr/pull/318
> > > >
> > > > rb
> > > >
> > > > On Wed, Mar 9, 2016 at 5:09 AM, Ravi Tatapudi
> > <[email protected]>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am Ravi Tatapudi, from IBM-India. I am working on a simple
> > > test-tool,
> > > > > that writes data to Parquet-files, which can be imported into
> > > > hive-tables.
> > > > > Pl. find attached sample-program, which writes simple
> > > parquet-data-file:
> > > > >
> > > > >
> > > > >
> > > > > Using the above program, I could create "parquet-files" with
> > > data-types:
> > > > > INT, LONG, STRING, Boolean...etc (i.e., basically all data-types
> > > > supported
> > > > > by "org.apache.avro.Schema.Type) & load it into "hive" tables
> > > > > successfully.
> > > > >
> > > > > Now, I am trying to figure out, how to write "date, timestamp,
> > decimal
> > > > > data" into parquet-files.  In this context, I request you 
provide
> > the
> > > > > possible options (and/or sample-program, if any..), in this
> regard.
> > > > >
> > > > > Thanks,
> > > > >  Ravi
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix



Reply via email to