rubenssoto edited a comment on issue #1829:
URL: https://github.com/apache/hudi/issues/1829#issuecomment-766821182
@vinothchandar
Thank you so much for your answer.
When do you plan to release this version? I will try to make some
workarounds until then.
Is this configuration right?
```
{ "conf": {
"spark.jars.packages": "org.apache.spark:spark-avro_2.12:2.4.4",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.jars":
"s3://dl/lib/hudi-spark-bundle_2.12-0.8.0-SNAPSHOT.jar",
"spark.sql.hive.convertMetastoreParquet": "false",
"spark.hadoop.hoodie.metadata.enable": "true"}
}
```
I made these 2 queries:
spark.read.format('hudi').load('s3://ze-data-lake/temp/order_test').count()
%%sql
select count('*') from raw_courier_api.order_test
On the pyspark query spark creates a job with 143 tasks, after 10 seconds of
listing the count was fast, but in the spark sql query spark creates a job with
2000 tasks and was very slow, is it a Hudi or spark issue?
SPARK SQL
<img width="1680" alt="Captura de Tela 2021-01-25 às 10 45 16"
src="https://user-images.githubusercontent.com/36298331/105713972-83bd7a80-5efa-11eb-91e0-b17ca1a3a394.png">
PYSPARK
<img width="1680" alt="Captura de Tela 2021-01-25 às 10 47 13"
src="https://user-images.githubusercontent.com/36298331/105714171-ca12d980-5efa-11eb-8a68-97dc880b2671.png">
Another problem that I got it, my table has 36 million rows, with that
config shows only 4 million.
Thank you so much!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]