Thanks Ryan!!

I am trying to write in parquet format using hive-exec1.2.1 , this jar supports 
to write to ORC/Parquet file formats hive supports.
My sample code looks like :

        MessageType schema = MessageTypeParser.parseMessageType("message basket 
{ required int64 time (TIMESTAMP_MILLIS); }");
        Configuration config = new Configuration();
        config.set("fs.default.name", "hdfs://<ip>:9000");
        String outFilePath;
        Path outDirPath = new Path("hdfs://<ip>:9000/user/hdfs/test5");

        ParquetWriter writer = new ParquetWriter(outDirPath, new 
DataWritableWriteSupport()
        {
          private DataWritableWriter writer;

          @Override
          public WriteContext init(Configuration configuration)
          {
            if (configuration.get(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA) 
== null)
            {
              configuration.set(DataWritableWriteSupport.PARQUET_HIVE_SCHEMA, 
schema.toString());
            }
            return super.init(configuration);
          }

        }, CompressionCodecName.SNAPPY, 256 * 1024 * 1024, 100 * 1024, 100 * 
1024, true, false, WriterVersion.PARQUET_2_0, config);

        List<ObjectInspector> list = new ArrayList<ObjectInspector>();
        ObjectInspector tins = 
ObjectInspectorFactory.getReflectionObjectInspector(Timestamp.class, 
ObjectInspectorOptions.JAVA);
        list.add(tins);

        List<String> columnnames = new ArrayList<String>();
        columnnames.add("time");

        StructObjectInspector inspec = 
ObjectInspectorFactory.getStandardStructObjectInspector(columnnames, list);

        List<Object> obj = new ArrayList<Object>();
        obj.add(new Timestamp(new Date().getTime()));
        writer.write(new ParquetHiveRecord(obj, inspec));
        writer.close();

====================================================================

As timestamp is defined in parquet schema by annotating an "INT64"(which 
designates a long) so it instantiates a LongStatistics object used after 
writing values.

Using code like(parquet.column.statistics.Statistics.java):
public static Statistics getStatsBasedOnType(PrimitiveType.PrimitiveTypeName 
type)
/*     */   {
/*  48 */     switch 
(1.$SwitchMap$parquet$schema$PrimitiveType$PrimitiveTypeName[type.ordinal()]) {
/*     */     case 1:
/*  50 */       return new IntStatistics();
/*     */     case 2:
/*  52 */       return new LongStatistics();
/*     */     case 3:
/*  54 */       return new FloatStatistics();
/*     */     case 4:
/*  56 */       return new DoubleStatistics();
/*     */     case 5:
/*  58 */       return new BooleanStatistics();
/*     */     case 6:
/*  60 */       return new BinaryStatistics();
/*     */     case 7:
/*  62 */       return new BinaryStatistics();
/*     */     case 8:
/*  64 */       return new BinaryStatistics();
/*     */     }
/*  66 */     throw new UnknownColumnTypeException(type);
/*     */   }


But when Timestamp is written, below code is executed and written in binary 
form:
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.java

Line 295:  case TIMESTAMP:
        Timestamp ts = ((TimestampObjectInspector) 
inspector).getPrimitiveJavaObject(value);
        recordConsumer.addBinary(NanoTimeUtils.getNanoTime(ts, 
false).toBinary());
        break;

Which leads to :
Parquet.Column.Impl.ColumnWriterV2.java
public void write(Binary value, int repetitionLevel, int definitionLevel)
/*     */   {
/* 154 */     if (DEBUG) log(value, repetitionLevel, definitionLevel);
/* 155 */     repetitionLevel(repetitionLevel);
/* 156 */     definitionLevel(definitionLevel);
/* 157 */     this.dataColumn.writeBytes(value);
/* 158 */     this.statistics.updateStats(value);====>>>>FAILS
/* 159 */     this.valueCount += 1;
/*     */   }

Since statistics is LongStatistics instantiated and it has method defined which 
accepts long value, but call is made with binary argument, and call goes to 
Base Class Statistics.java where it throws unsupported exception.

Complete StackTrace:

Exception in thread "main" java.lang.RuntimeException: Parquet record is 
malformed: null
        at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
        at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
        at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
        at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
        at parquet.hadoop.ParquetWriter.write(ParquetWriter.java:258)
        at ParquetTestWriter$1.run(ParquetTestWriter.java:126)
        at ParquetTestWriter$1.run(ParquetTestWriter.java:1)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at ParquetTestWriter.main(ParquetTestWriter.java:37)
Caused by: java.lang.UnsupportedOperationException
        at parquet.column.statistics.Statistics.updateStats(Statistics.java:115)
        at parquet.column.impl.ColumnWriterV2.write(ColumnWriterV2.java:158)
        at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:346)
        at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writePrimitive(DataWritableWriter.java:297)
        at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:106)
        at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
        at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
        ... 10 more

Appreciate your help!

-Manisha

-----Original Message-----
From: Ryan Blue [mailto:[email protected]]
Sent: Tuesday, November 03, 2015 10:19 PM
To: [email protected]
Subject: Re: Timestamp and time not being written Unsopported

On 11/03/2015 12:35 AM, Manisha Sethi wrote:
> Hi
>
> I am trying to write timestamp using int64 (TIMESTAMP_MILLIS) via 
> ParquetWriter using jar hive-exec 1.2.1... But getting unsopprted exception...
> Issue is : when call reaches "add binary"
>
>          break;
>        case BINARY:
>          byte[] vBinary = ((BinaryObjectInspector) 
> inspector).getPrimitiveJavaObject(value);
>          recordConsumer.addBinary(Binary.fromByteArray(vBinary));
>          break;
>        case TIMESTAMP:
>          Timestamp ts = ((TimestampObjectInspector) 
> inspector).getPrimitiveJavaObject(value);
>          recordConsumer.addBinary(NanoTimeUtils.getNanoTime(ts, 
> false).toBinary());
>          break;
>        case DECIMAL:
>          HiveDecimal vDecimal = 
> ((HiveDecimal)inspector.getPrimitiveJavaObject(value));
>          DecimalTypeInfo decTypeInfo = 
> (DecimalTypeInfo)inspector.getTypeInfo();
>          recordConsumer.addBinary(decimalToBinary(vDecimal, decTypeInfo));
>          break;
>
> Then in Columnwriter it fails at updatestatistics, since call is made
> using longstatistic(corrs to its int64 data type but value is binary
> which is not defined)
>
> this.repetitionLevelColumn.writeInteger(repetitionLevel);
> /* 203 */     this.definitionLevelColumn.writeInteger(definitionLevel);
> /* 204 */     this.dataColumn.writeBytes(value);
> /* 205 */     updateStatistics(value);
>
> this.statistics.updateStats(value);====>>>> Method is not defined for
> LongStatistics, hence throws unsupported exception
>
> ________________________________
>
>

Manisha,

Thanks for taking the time to e-mail about this. How are you writing this 
timestamp? Are you using Hive, or are you trying to use Hive's object model in 
your own code?

Could you also send the stack trace that you're seeing? I'm confused about why 
the method would be undefined, since it should be defined for the types 
correctly.

Thanks,

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

________________________________

Reply via email to