[GitHub] [iceberg] rlcyf opened a new issue #2289: data is not updated in spark-shell

GitBox Tue, 02 Mar 2021 09:58:03 -0800


rlcyf opened a new issue #2289:
URL: https://github.com/apache/iceberg/issues/2289



   
   spark 3.0.1
   iceberg 0.11
   
   ```
   # push one data to kafka
   bin/kafka-console-producer.sh --topic test --bootstrap-server localhost:9092
   > {"user_id":1}
   ```
   
   ```
   # use structured-streaming consume data and the consumption is successful
   val tableIdentifier: String = ...
   data.writeStream
       .format("iceberg")
       .outputMode("append")
       .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES))
       .option("path", tableIdentifier)
       .option("checkpointLocation", checkpointPath)
       .start()
   ```
   
   when I execute a query in spark-shell
   ```
   bin/spark-shell --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 --conf spark.sql.catalog.prod=org.apache.iceberg.spark.SparkCatalog --conf 
spark.sql.catalog.prod.type=hive --conf 
spark.sql.catalog.prod.warehouse=hdfs://localhost:9000/prod --conf 
spark.sql.warehouse.dir=hdfs://localhost:9000/prod
   
   spark.sql("select * from prod.db.sample").count
   res0: Long = 1
   
   # count on trino
   trino:db> select count(1) from prod.db.sample;
     1
    (1 rows)
   ```
   
   ```
   # push one data again
   bin/kafka-console-producer.sh --topic test --bootstrap-server localhost:9092
   > {"user_id":1}
   ```
   ```
   spark.sql("select * from prod.db.sample").count
   res0: Long = 1
   
   # count on trino
   trino:db> select count(1) from prod.db.sample;
     2
    (1 rows)
   ```
   in trino, the correct results can be queried in real time
   when I close spark-shell, restart it
   ```
   spark.sql("select * from prod.db.sample").count
   res0: Long = 2
   ```
   the result is correct
   
   there is another situation，after inserting the data, a period of time has 
passed (i don't know how long it takes)
   query again！ the result of the query is correct!
   
   Has a merger compact?
   How can I set up to check the correct data in real-time in the spark shell?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rlcyf opened a new issue #2289: data is not updated in spark-shell

Reply via email to