[GitHub] [incubator-hudi] popart opened a new issue #1329: [SUPPORT] Presto cannot query non-partitioned table

GitBox Wed, 12 Feb 2020 14:11:13 -0800

popart opened a new issue #1329: [SUPPORT] Presto cannot query non-partitioned 
table
URL: https://github.com/apache/incubator-hudi/issues/1329
 
 
   **Describe the problem you faced**
   
   I made a non-partitioned Hudi table using Spark. I was able to query it with 
Spark & Presto, but when I tried querying it with Presto, I received the error 
`Could not find partitionDepth in partition metafile`.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Use an an emr-5.28.0 cluster
   2. Run spark shell: 
   ```
   spark-shell --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 
--deploy-mode client
   ```
   3. Run spark code:
   ```
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.hudi.hive._
   import org.apache.hudi.keygen.NonpartitionedKeyGenerator
   
   val inputPath = "s3://path/to/a/parquet/file"
   val tableName = "my_test_table"
   val basePath = "s3://test-bucket/my_test_table" 
   
   val inputDf = spark.read.parquet(inputPath)
   
   val hudiOptions = Map[String,String](
       RECORDKEY_FIELD_OPT_KEY -> "dim_advertiser_id",
       PRECOMBINE_FIELD_OPT_KEY -> "update_time",
       TABLE_NAME -> tableName,
       KEYGENERATOR_CLASS_OPT_KEY -> 
classOf[NonpartitionedKeyGenerator].getCanonicalName, //needed for non 
partitioned table
       HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[NonPartitionedExtractor].getCanonicalName, //needed for non partitioned 
table
       OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
       HIVE_SYNC_ENABLED_OPT_KEY -> "true",
       HIVE_TABLE_OPT_KEY -> tableName,
       TABLE_TYPE_OPT_KEY -> COW_TABLE_TYPE_OPT_VAL,   
"hoodie.bulkinsert.shuffle.parallelism" -> "10")
   
   inputDf.write.format("org.apache.hudi").
       options(bulk_insert_hudiOptions).
       mode(Overwrite).
       save(basePath);
   ```
   4. Querying the table in Spark or Hive both work
   5. Querying the table in Presto fails
   ```
   [hadoop@ip-172-31-128-118 ~]$ presto-cli --catalog hive --schema default
   presto:default> select count(*) from my_test_table;
   
   Query 20200211_185123_00018_pruwt, FAILED, 1 node
   Splits: 17 total, 0 done (0.00%)
   0:02 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20200211_185123_00018_pruwt failed: Could not find partitionDepth in 
partition metafile
   com.facebook.presto.spi.PrestoException: Could not find partitionDepth in 
partition metafile
     at 
com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:200)
     at 
com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47)
     at 
com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20)
     at 
com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35)
     at 
io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: Could not find 
partitionDepth in partition metafile
     at 
org.apache.hudi.common.model.HoodiePartitionMetadata.getPartitionDepth(HoodiePartitionMetadata.java:75)
     at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.getTableMetaClient(HoodieParquetInputFormat.java:209)
     at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.groupFileStatus(HoodieParquetInputFormat.java:158)
     at 
org.apache.hudi.hadoop.HoodieParquetInputFormat.listStatus(HoodieParquetInputFormat.java:69)
     at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:288)
     at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadPartition(BackgroundHiveSplitLoader.java:371)
     at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:264)
     at 
com.facebook.presto.hive.BackgroundHiveSplitLoader.access$300(BackgroundHiveSplitLoader.java:96)
     at 
com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:193)
     ... 7 more
   ```
   
   **Expected behavior**
   
   Presto should return a count of all the rows. Other Presto queries should 
succeed.
   
   **Environment Description**
   
   * EMR version: emr-5.28.0
   
   * Hudi version : 0.5.1-incubating
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : 2.8.5
   
   * Presto version: 0.277
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   Included in "Steps to reproduce".


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] popart opened a new issue #1329: [SUPPORT] Presto cannot query non-partitioned table

Reply via email to