[GitHub] [hudi] FelixKJose opened a new issue #1895: HUDI Dataset backed by Hive Metastore fails on Presto with Unknown converted type TIMESTAMP_MICROS

GitBox Fri, 31 Jul 2020 17:00:13 -0700


FelixKJose opened a new issue #1895:
URL: https://github.com/apache/hudi/issues/1895



   I am getting an exception Unknown converted type TIMESTAMP_MICROS while 
querying HUDI Dataset backed by Hive metastore using Presto. 
   **My DF Schema:** 
   ```
   schema = StructType().add("_id", StringType()) \
       .add("employer", StringType()) \
       .add("created_at", TimestampType()) \
       .add("name", StringType())
   ```
   
   My script with Hudi options:
   ```
   import uuid
   from datetime import datetime
   
   from pyspark.sql import SparkSession
   from pyspark.sql.types import StructType, StringType, TimestampType
   
   
   def spark():
       """
       This function is invoked to create the Spark Session.
   
       :return: the spark session
       """
       spark_session = (SparkSession \
                        .builder \
                        .appName("Data_Experimentation_Framework") \
                        .getOrCreate())
   
       spark_session.conf.set("spark.sql.parquet.outputTimestampType", 
"TIMESTAMP_MILLIS")
   
       return spark_session
   
   
   class NullNamespace:
       bytes = b''
   
   
   schema = StructType().add("_id", StringType()) \
       .add("employer", StringType()) \
       .add("created_at", TimestampType()) \
       .add("name", StringType())
   
   employees = [{'_id': uuid.uuid3(NullNamespace, "Felix").__str__(),
                 'employer': 'Philips',
                 'created_at': datetime.now(),
                 'name': 'Felix'
                 },
                {'_id': uuid.uuid3(NullNamespace, "Steve").__str__(),
                 'employer': 'Apple',
                 'created_at': datetime.now(),
                 'name': 'Steve'
                 },
                ]
   df = spark().createDataFrame(employees, schema=schema)
   
   df.printSchema()
   
   hudi_options = {
       # ---------------DATA SOURCE WRITE CONFIGS---------------#
       'hoodie.table.name': 'employees',
       'hoodie.datasource.write.recordkey.field': '_id',
       'hoodie.datasource.write.precombine.field': 'created_at',
       'hoodie.datasource.write.partitionpath.field': 'employer',
       'hoodie.datasource.write.hive.style.partitioning': 'true',
       # ---------------HIVE SPECIFIC CONFIGS--------------------#
       'hoodie.datasource.hive_sync.enable': 'true',
       'hoodie.datasource.hive_sync.table': 'employees',
       'hoodie.datasource.hive_sync.partition_fields': 'employer',
       'hoodie.datasource.hive_sync.jdbcurl': 
'jdbc:hive2://*********.ec2.internal:10000',
       'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       # ---------------WRITE CLIENT CONFIGS-----------------------#
       'hoodie.upsert.shuffle.parallelism': 10,
       'hoodie.insert.shuffle.parallelism': 10,
       'hoodie.consistency.check.enabled': True,
       # ---------------INDEX CONFIGS------------------------------#
       'hoodie.index.type': 'BLOOM',
       'hoodie.index.bloom.num_entries': 60000,
       'hoodie.index.bloom.fpp': 0.000000001,
       # ---------------STORAGE CONFIGS------------------------------#
       'hoodie.cleaner.commits.retained': 2,
   }
   
   df.write \
       .format("org.apache.hudi") \
       .options(**hudi_options) \
       .mode("append") \
       .save("s3://spark-hudi-poc/employees")
   ```
   **As you can see I have mentioned the timestamp time as TIMESTAMP_MILLIS but 
still while writing using HUDI this config is getting overrides and chooses. 
TIMESTAMP_MICROS instead** 
   
   Spark-submit conf:
   `--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.hive.convertMetastoreParquet=false --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar`
   
   **Environment Description**
   *  AWS EMR: 6.0.0
   
   * Hudi version : Custom version with performance fixes
   
   * Spark version : 2.4.4
   
   * Hive version : 3.1.0
   
   * Hadoop version :3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Stacktrace**
   
   ```An error occurred while calling o134.next. : java.sql.SQLException: Query 
failed (#20200730_181450_00003_yhjz2): Error opening Hive split 
s3://spark-kafka-poc/employees/Apple/ca4d0c17-5db0-4869-8fec-728ce62286f4-0_0-20-86_20200730160701.parquet
 (offset=0, length=435469): Unknown converted type TIMESTAMP_MICROS at 
com.facebook.presto.jdbc.PrestoResultSet.resultsException(PrestoResultSet.java:1840)
 at 
com.facebook.presto.jdbc.PrestoResultSet$ResultsPageIterator.computeNext(PrestoResultSet.java:1820)
 at 
com.facebook.presto.jdbc.PrestoResultSet$ResultsPageIterator.computeNext(PrestoResultSet.java:1759)
 at 
com.facebook.presto.jdbc.internal.guava.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
 at 
com.facebook.presto.jdbc.internal.guava.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
 at 
com.facebook.presto.jdbc.internal.guava.collect.TransformedIterator.hasNext(TransformedIterator.java:42)
 at com.facebook.presto.jdbc.internal.guava.collect.Iterators$Co
 ncatenatedIterator.getTopMetaIterator(Iterators.java:1311) at 
com.facebook.presto.jdbc.internal.guava.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1327)
 at com.facebook.presto.jdbc.PrestoResultSet.next(PrestoResultSet.java:144) at 
sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at 
py4j.Gateway.invoke(Gateway.java:259) at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at 
py4j.commands.CallCommand.execute(CallCommand.java:79) at 
py4j.GatewayConnection.run(GatewayConnection.java:209) at 
java.lang.Thread.run(Thread.java:748) Caused by: 
com.facebook.presto.spi.PrestoException: Error opening Hive split 
s3://spark-kafka-poc/employees/Apple/ca4d0c17-5db0-4869-8fec-728ce6
 2286f4-0_0-20-86_20200730160701.parquet (offset=0, length=435469): Unknown 
converted type TIMESTAMP_MICROS at 
com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:258)
 at 
com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:143)
 at 
com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:300)
 at 
com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:123)
 at 
com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:51)
 at 
com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:58)
 at 
com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:248)
 at com.facebook.presto.operator.Driver.processInternal(Driver.java:379) at 
com.facebook.presto.operator.Driver.
 lambda$processFor$8(Driver.java:283) at 
com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675) at 
com.facebook.presto.operator.Driver.processFor(Driver.java:276) at 
com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
 at 
com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
 at 
com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483)
 at com.facebook.presto.$gen.Presto_0_230____20200724_204412_1.run(Unknown 
Source) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
... 1 more Caused by: java.lang.IllegalArgumentException: Unknown converted 
type TIMESTAMP_MICROS at 
com.facebook.presto.parquet.reader.MetadataReader.getOriginalType(MetadataReader.java:294)
 at com.facebook.presto.parquet.reader.MetadataReader.readTypeSchema(MetadataRea
 der.java:196) at 
com.facebook.presto.parquet.reader.MetadataReader.readParquetSchema(MetadataReader.java:168)
 at 
com.facebook.presto.parquet.reader.MetadataReader.readFooter(MetadataReader.java:110)
 at 
com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:186)
 ... 17 more``
   
   Could you let me know how can I ensure  timestamp is stored as 
TIMESTAMP_MILLIS?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] FelixKJose opened a new issue #1895: HUDI Dataset backed by Hive Metastore fails on Presto with Unknown converted type TIMESTAMP_MICROS

Reply via email to