FelixKJose opened a new issue #1895:
URL: https://github.com/apache/hudi/issues/1895
I am getting an exception Unknown converted type TIMESTAMP_MICROS while
querying HUDI Dataset backed by Hive metastore using Presto.
**My DF Schema:**
```
schema = StructType().add("_id", StringType()) \
.add("employer", StringType()) \
.add("created_at", TimestampType()) \
.add("name", StringType())
```
My script with Hudi options:
```
import uuid
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, TimestampType
def spark():
"""
This function is invoked to create the Spark Session.
:return: the spark session
"""
spark_session = (SparkSession \
.builder \
.appName("Data_Experimentation_Framework") \
.getOrCreate())
spark_session.conf.set("spark.sql.parquet.outputTimestampType",
"TIMESTAMP_MILLIS")
return spark_session
class NullNamespace:
bytes = b''
schema = StructType().add("_id", StringType()) \
.add("employer", StringType()) \
.add("created_at", TimestampType()) \
.add("name", StringType())
employees = [{'_id': uuid.uuid3(NullNamespace, "Felix").__str__(),
'employer': 'Philips',
'created_at': datetime.now(),
'name': 'Felix'
},
{'_id': uuid.uuid3(NullNamespace, "Steve").__str__(),
'employer': 'Apple',
'created_at': datetime.now(),
'name': 'Steve'
},
]
df = spark().createDataFrame(employees, schema=schema)
df.printSchema()
hudi_options = {
# ---------------DATA SOURCE WRITE CONFIGS---------------#
'hoodie.table.name': 'employees',
'hoodie.datasource.write.recordkey.field': '_id',
'hoodie.datasource.write.precombine.field': 'created_at',
'hoodie.datasource.write.partitionpath.field': 'employer',
'hoodie.datasource.write.hive.style.partitioning': 'true',
# ---------------HIVE SPECIFIC CONFIGS--------------------#
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'employees',
'hoodie.datasource.hive_sync.partition_fields': 'employer',
'hoodie.datasource.hive_sync.jdbcurl':
'jdbc:hive2://*********.ec2.internal:10000',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
# ---------------WRITE CLIENT CONFIGS-----------------------#
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.consistency.check.enabled': True,
# ---------------INDEX CONFIGS------------------------------#
'hoodie.index.type': 'BLOOM',
'hoodie.index.bloom.num_entries': 60000,
'hoodie.index.bloom.fpp': 0.000000001,
# ---------------STORAGE CONFIGS------------------------------#
'hoodie.cleaner.commits.retained': 2,
}
df.write \
.format("org.apache.hudi") \
.options(**hudi_options) \
.mode("append") \
.save("s3://spark-hudi-poc/employees")
```
**As you can see I have mentioned the timestamp time as TIMESTAMP_MILLIS but
still while writing using HUDI this config is getting overrides and chooses.
TIMESTAMP_MICROS instead**
Spark-submit conf:
`--conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
spark.sql.hive.convertMetastoreParquet=false --jars
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar`
**Environment Description**
* AWS EMR: 6.0.0
* Hudi version : Custom version with performance fixes
* Spark version : 2.4.4
* Hive version : 3.1.0
* Hadoop version :3.2.1
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Stacktrace**
```An error occurred while calling o134.next. : java.sql.SQLException: Query
failed (#20200730_181450_00003_yhjz2): Error opening Hive split
s3://spark-kafka-poc/employees/Apple/ca4d0c17-5db0-4869-8fec-728ce62286f4-0_0-20-86_20200730160701.parquet
(offset=0, length=435469): Unknown converted type TIMESTAMP_MICROS at
com.facebook.presto.jdbc.PrestoResultSet.resultsException(PrestoResultSet.java:1840)
at
com.facebook.presto.jdbc.PrestoResultSet$ResultsPageIterator.computeNext(PrestoResultSet.java:1820)
at
com.facebook.presto.jdbc.PrestoResultSet$ResultsPageIterator.computeNext(PrestoResultSet.java:1759)
at
com.facebook.presto.jdbc.internal.guava.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at
com.facebook.presto.jdbc.internal.guava.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at
com.facebook.presto.jdbc.internal.guava.collect.TransformedIterator.hasNext(TransformedIterator.java:42)
at com.facebook.presto.jdbc.internal.guava.collect.Iterators$Co
ncatenatedIterator.getTopMetaIterator(Iterators.java:1311) at
com.facebook.presto.jdbc.internal.guava.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1327)
at com.facebook.presto.jdbc.PrestoResultSet.next(PrestoResultSet.java:144) at
sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at
py4j.Gateway.invoke(Gateway.java:259) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at
py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:209) at
java.lang.Thread.run(Thread.java:748) Caused by:
com.facebook.presto.spi.PrestoException: Error opening Hive split
s3://spark-kafka-poc/employees/Apple/ca4d0c17-5db0-4869-8fec-728ce6
2286f4-0_0-20-86_20200730160701.parquet (offset=0, length=435469): Unknown
converted type TIMESTAMP_MICROS at
com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:258)
at
com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:143)
at
com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:300)
at
com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:123)
at
com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:51)
at
com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:58)
at
com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:248)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:379) at
com.facebook.presto.operator.Driver.
lambda$processFor$8(Driver.java:283) at
com.facebook.presto.operator.Driver.tryWithLock(Driver.java:675) at
com.facebook.presto.operator.Driver.processFor(Driver.java:276) at
com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
at
com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at
com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:483)
at com.facebook.presto.$gen.Presto_0_230____20200724_204412_1.run(Unknown
Source) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more Caused by: java.lang.IllegalArgumentException: Unknown converted
type TIMESTAMP_MICROS at
com.facebook.presto.parquet.reader.MetadataReader.getOriginalType(MetadataReader.java:294)
at com.facebook.presto.parquet.reader.MetadataReader.readTypeSchema(MetadataRea
der.java:196) at
com.facebook.presto.parquet.reader.MetadataReader.readParquetSchema(MetadataReader.java:168)
at
com.facebook.presto.parquet.reader.MetadataReader.readFooter(MetadataReader.java:110)
at
com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:186)
... 17 more``
Could you let me know how can I ensure timestamp is stored as
TIMESTAMP_MILLIS?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]