Rap70r opened a new issue #2586:
URL: https://github.com/apache/hudi/issues/2586
Hello,
We have a setup where we process data incrementally against large Hudi
tables in S3, using Hudi and Spark. When reading large tables from a different
spark process or when applying time consuming queries against spark dataframes,
the reading process crashes if another process attempts to update that table
incrementally. I assume due to underlying parquet partitions being modified
while the dataframe still being queried.
How can we isolate the table when reading and performing queries against
that dataframe in Spark without being affected by the writers?
* Sample Code
```
import org.apache.spark.sql.{SparkSession}
import org.apache.hudi._
val ss = SparkSession.builder().getOrCreate()
val df = ss.read
.format("org.apache.hudi")
.option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(),
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
.load("s3://path/to/hudi/table/*")
df.createOrReplaceTempView("hudi_table")
```
While performing queries against 'hudi_table', if any process updates the
table under that S3 path the table is located, the query crashes.
How can we guarantee snapshot isolation when reading without being affected
by writers?
**Environment Description**
* Hudi version: 0.7.0
* Spark version: 3.0.1
* Hadoop version: 3.2.1
* Storage: S3
* Running on Docker: No
Thank you
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]