[ https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253531#comment-16253531 ]
Maciej BryĆski edited comment on SPARK-16996 at 11/15/17 9:11 PM: ------------------------------------------------------------------ In Spark 2.2 even major compaction doesn't help if we're modifying data after compaction. Any delta files creates exception: {code} scala> spark.sql("select * from hello_acid").show() java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:314) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2854) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2838) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837) at org.apache.spark.sql.Dataset.head(Dataset.scala:2154) at org.apache.spark.sql.Dataset.take(Dataset.scala:2367) at org.apache.spark.sql.Dataset.showString(Dataset.scala:245) at org.apache.spark.sql.Dataset.show(Dataset.scala:641) at org.apache.spark.sql.Dataset.show(Dataset.scala:600) at org.apache.spark.sql.Dataset.show(Dataset.scala:609) ... 48 elided Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0000016_0000" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 88 more Caused by: java.lang.NumberFormatException: For input string: "0000016_0000" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta(AcidUtils.java:310) at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:379) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:634) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:620) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} [~srowen] I think at least reading from such a tables should be supported. ACID is rather old Hive feature. This exception was created on HDP 2.6 using standard components. was (Author: maver1ck): In Spark 2.2 even major compaction doesn't help if we're modifying data after compaction. Any delta files creates exception: {code} scala> spark.sql("select * from hello_acid").show() java.lang.RuntimeException: serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:314) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2854) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2838) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2837) at org.apache.spark.sql.Dataset.head(Dataset.scala:2154) at org.apache.spark.sql.Dataset.take(Dataset.scala:2367) at org.apache.spark.sql.Dataset.showString(Dataset.scala:245) at org.apache.spark.sql.Dataset.show(Dataset.scala:641) at org.apache.spark.sql.Dataset.show(Dataset.scala:600) at org.apache.spark.sql.Dataset.show(Dataset.scala:609) ... 48 elided Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0000016_0000" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998) ... 88 more Caused by: java.lang.NumberFormatException: For input string: "0000016_0000" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:589) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.hive.ql.io.AcidUtils.parseDelta(AcidUtils.java:310) at org.apache.hadoop.hive.ql.io.AcidUtils.getAcidState(AcidUtils.java:379) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:634) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$FileGenerator.call(OrcInputFormat.java:620) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} [~srowen] I think at least reading from such a tables should be supported. ACID is rather old Hive feature. > Hive ACID delta files not seen > ------------------------------ > > Key: SPARK-16996 > URL: https://issues.apache.org/jira/browse/SPARK-16996 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0 > Environment: Hive 1.2.1, Spark 1.5.2 > Reporter: Benjamin BONNET > Priority: Critical > > spark-sql seems not to see data stored as delta files in an ACID Hive table. > Actually I encountered the same problem as describe here : > http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp > For example, create an ACID table with HiveCLI and insert a row : > {code} > set hive.support.concurrency=true; > set hive.enforce.bucketing=true; > set hive.exec.dynamic.partition.mode=nonstrict; > set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; > set hive.compactor.initiator.on=true; > set hive.compactor.worker.threads=1; > CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 > BUCKETS > ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS > INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > TBLPROPERTIES ('transactional'='true'); > INSERT INTO deltas VALUES("a","a"); > {code} > Then make a query with spark-sql CLI : > {code} > SELECT * FROM deltas; > {code} > That query gets no result and there are no errors in logs. > If you go to HDFS to inspect table files, you find only deltas > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxr-x--- - me hdfs 0 2016-08-10 14:03 > /apps/hive/warehouse/deltas/delta_0020943_0020943 > {code} > Then if you run compaction on that table (in HiveCLI) : > {code} > ALTER TABLE deltas COMPACT 'MAJOR'; > {code} > As a result, the delta will be compute into a base file : > {code} > ~>hdfs dfs -ls /apps/hive/warehouse/deltas > Found 1 items > drwxrwxrwx - me hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > {code} > Go back to spark-sql and the same query gets a result : > {code} > SELECT * FROM deltas; > a a > Time taken: 0.477 seconds, Fetched 1 row(s) > {code} > But next time you make an insert into Hive table : > {code} > INSERT INTO deltas VALUES("b","b"); > {code} > spark-sql will immediately see changes : > {code} > SELECT * FROM deltas; > a a > b b > Time taken: 0.122 seconds, Fetched 2 row(s) > {code} > Yet there was no other compaction, but spark-sql "sees" the base AND the > delta file : > {code} > ~> hdfs dfs -ls /apps/hive/warehouse/deltas > Found 2 items > drwxrwxrwx - valdata hdfs 0 2016-08-10 15:25 > /apps/hive/warehouse/deltas/base_0020943 > drwxr-x--- - valdata hdfs 0 2016-08-10 15:31 > /apps/hive/warehouse/deltas/delta_0020956_0020956 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org