to Parquet-formatted files on HDFS.

Ryan Blue Fri, 29 Jul 2016 07:41:45 -0700

I think the only support right now is in Hive and Spark.

On Fri, Jul 29, 2016 at 7:15 AM, Ravi Tatapudi <[email protected]>
wrote:


> Hello Ryan:
>
> Did you get a chance to see my queris in the mail below ? Basically, I am
> trying to understand, using which API, we would use to be able to read the
> "timestamp" data (even after truncating "nano/micro/milli seconds part)
> from Parquet-files, created by Hive or any other application (which
> essentially boils down to the below queries...). Is it Parquet-Avro-API
> (or) some other API ?
>
> ------------------------------------
> 1) Is it possible to read "timestamp" data from a "parquet-file"
> (generated by hive, as part of a table stored as parquet & timestamp-rows
> inserted) using a "standalone-JAVA-application" using
> "parquet-avro-API-1.90" ?
>
> 2) Is it possible to read "timestamp" data written to a parquet-file (by a
> stand-alone-JAVA-application, using "parquet-avro-API-1.9.0") would be
> read by "hive" successfully ?
>
> 3) Using Parquet-1.9.0-API, when we try to read/write data from hive, does
> it successfully reads (or writes) the data, after truncating the "nano
> seconds" part (or) will it fail with "incompatible object" errors ?
> ------------------------------------
>
> Could you please let me know your thoughts...
>
> Thanks,
>  Ravi
>
>
>
> From:   Ravi Tatapudi/India/IBM
> To:     [email protected]
> Cc:     Srinivas Mudigonda/India/IBM@IBMIN
> Date:   07/25/2016 11:35 AM
> Subject:        Re: To read/write "timestamp" data from/to
> Parquet-formatted files on HDFS.
>
>
> Hello Ryan:
>
> Many thanks for the reply.
>
> Our requirement is to read "timestamp" data, from parquet-files on HDFS
> (created as part of "hive-tables stored as parquet"). At this point, we
> are not really looking for "milli / micro / nano" seconds part of
> timestamp, but trying to read the timestamp data as: "YYYY-MM-DD hh:mm:ss"
> format.
>
> In this context, could you please provide your inputs to the following
> queries, so that we can plan accordingly:
>
> ================================================================
> 1) Is it possible to read "timestamp" data from a "parquet-file"
> (generated by hive, as part of a table stored as parquet & timestamp-rows
> inserted) using a "standalone-JAVA-application" using
> "parquet-avro-API-1.90" ?
>
> 2) Is it possible to read "timestamp" data written to a parquet-file (by a
> stand-alone-JAVA-application, using "parquet-avro-API-1.9.0") would be
> read by "hive" successfully ?
>
> 3) Using Parquet-1.9.0-API, when we try to read/write data from hive, does
> it successfully reads (or writes) the data, after truncating the "nano
> seconds" part (or) will it fail with "incompatible object" errors ?
> ================================================================
>
> Thanks,
>  Ravi
>
>
>
>
>
> From:   Ryan Blue <[email protected]>
> To:     Parquet Dev <[email protected]>
> Cc:     Srinivas Mudigonda/India/IBM@IBMIN
> Date:   07/22/2016 11:27 PM
> Subject:        Re: To read/write "timestamp" data from/to
> Parquet-formatted files on HDFS.
>
>
>
> Hi Ravi,
>
> Hive's int96 timestamp is based on an format originally used by the Impala
> project. It isn't well-defined, assumes that all int96 values are
> timestamps, and implements nanosecond precision. It's not a good idea to
> use it, so I don't think we will be implementing support for it in the
> Avro
> API. There is, however, support for timestamp-millis and timestamp-micros
> types in 1.9.0.
>
> rb
>
> On Wed, Jul 6, 2016 at 3:17 AM, Ravi Tatapudi <[email protected]>
> wrote:
>
> > Hello,
> >
> > I tried reading timestamp-data from a parquet-file (created as part of
> > hive-table stored in parquet-format) with a java-sample-program using
> > parquet-avro-API-version: 1.8.1 and I got the below exception:
> >
> > ================================================
> > java.lang.IllegalArgumentException: INT96 not yet implemented.
> >         at
> >
> >
>
> org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:252)
> >         at
> >
> >
>
> org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:237)
> >         at
> >
> >
>
> org.apache.parquet.schema.PrimitiveType$PrimitiveTypeName$7.convert(PrimitiveType.java:223)
> >         at
> >
> >
>
> org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:236)
> >         at
> >
> >
>
> org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:216)
> >         at
> >
> >
>
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:210)
> >         at
> >
> >
>
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:124)
> >         at
> >
> >
>
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:171)
> >         at
> >
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:149)
> >         at
> > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:125)
> >         at pqtr.main(pqtr.java:63)
> > ================================================
> >
> > I have looked at Parquet-code & see the following in
> > "org/apache/parquet/avro/AvroSchemaConverter.java":
> >
> >             public Schema convertINT96(PrimitiveTypeName
> > primitiveTypeName) {
> >               throw new IllegalArgumentException("INT96 not yet
> > implemented.");
> >
> > However, in other parts of the code (in files:
> > org/apache/parquet/encodings/FileEncodingsIT.java &
> > org/apache/parquet/statistics/TestStatistics.java), I see that
> > "convertINT96" is implemented to return Binary values.
> >
> > In this context, I am trying to figure out, why "Parquet-Avro-API" is
> > throwing error, instead of trying to return "Binary" (or)
> > Fixed_len_binary_array" values ?
> >
> > Will this be supported in the next Parquet-release (1.9.0?). If it is
> > already fixed & can be obtained via a pull-request, I request you to
> point
> > me to the same.
> >
> > Thanks,
> >  Ravi
> >
> >
> >
> > From:   Ravi Tatapudi/India/IBM
> > To:     [email protected]
> > Date:   07/04/2016 12:28 PM
> > Subject:        To read/write "timestamp" data from/to Parquet-formatted
> > files on HDFS.
> >
> >
> > Hello,
> >
> > I am trying to write/read "timestamp" data to/from
> > Parquet-formatted-files.
> >
> > As I understand, "the latest parquet-avro API version 1.8.1" doesn't
> > support "timestamp".  Is this context, what other options/APIs are
> > available to read/write "timestamp" data from/to parquet-files ?
> >
> > Please let me know (and if there are any examples, could you please
> point
> > me to the same).
> >
> > Thanks,
> >  Ravi
> >
> >
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: To read/write "timestamp" data from/to Parquet-formatted files on HDFS.

Reply via email to