Rap70r opened a new issue #2586:
URL: https://github.com/apache/hudi/issues/2586


   Hello,
   
   We have a setup where we process data incrementally against large Hudi 
tables in S3, using Hudi and Spark. When reading large tables from a different 
spark process or when applying time consuming queries against spark dataframes, 
the reading process crashes  if another process attempts to update that table 
incrementally. I assume due to underlying parquet partitions being modified 
while the dataframe still being queried.
   How can we isolate the table when reading and performing queries against 
that dataframe in Spark without being affected by the writers?
   
   * Sample Code
   ```
   import org.apache.spark.sql.{SparkSession}
   import org.apache.hudi._
   
   val ss = SparkSession.builder().getOrCreate()
        
   val df = ss.read
        .format("org.apache.hudi")
        .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL())
        .load("s3://path/to/hudi/table/*")
         
   df.createOrReplaceTempView("hudi_table")
   ```
   
   While performing queries against 'hudi_table', if any process updates the 
table under that S3 path the table is located, the query crashes.
   How can we guarantee snapshot isolation when reading without being affected 
by writers?
   
   **Environment Description**
   * Hudi version: 0.7.0
   * Spark version: 3.0.1
   * Hadoop version: 3.2.1
   * Storage: S3
   * Running on Docker: No
   
   Thank you


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to