[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

GitBox Fri, 31 Jul 2020 02:17:13 -0700


bvaradar commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-667023700



   For a monotonically increasing id, you can use bulk-insert instead of insert 
for first time loading of files, this would nicely order records by the id and 
your range-pruning during index lookup would be efficient. The parallelism 
configuration 
https://hudi.apache.org/docs/configurations.html#withBulkInsertParallelism 
controls the number of file getting generated. 
   
   `I will use aws Athena to query all my tables and this specific order table 
may be delayed up to 15 minutes. I saw that Athena only query Read Optmized 
MoR, how MoR could help me in this case?`
     ===>. Would let @umehrot2  answer this question. But If your use-case 
allows, you can schedule compaction for the MOR table at a frequency to align 
with a SLA that you want to maintain. This way you can still query those data 
using RO.
   
   For the insert operations, the same config (as in upsert) controls file 
sizing ('hoodie.parquet.max.file.size')
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bvaradar commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

Reply via email to