[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

GitBox Fri, 10 Apr 2020 15:48:45 -0700

vinothchandar edited a comment on issue #1498: Migrating parquet table to hudi 
issue [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-612252471
 
 
   @vontman @ahmed-elfar First of all. Thanks for all the detailed information! 
   
   Answers to the good questions you raised 
   
   > Is that the normal time for initial loading for Hudi tables, or we are 
doing something wrong?
   It's hard to know what normal time is since it depends on schema, machine 
and so many things. But we should n't this very off. Tried to explain few 
things below. 
   
   > Do we need a better cluster/recoures to be able to load the data for the 
first time?, because it is mentioned on Hudi confluence page that COW 
bulkinsert should match vanilla parquet writing + sort only.
   
   If you are trying to ultimately migrate a table (using bulk_insert once) and 
then do updates/deletes. I suggest, testing upserts/deletes rather than 
bulk_insert.. If you primarily want to do bulk_insert alone to get other 
benefits of Hudi. Happy to work with you more and resolve this. Perf is a major 
push for the next release. So we can def collaborate here
   
   
   > Does partitioning improves the upsert and/or compaction time for Hudi 
tables, or just to improve the analytical queries (partition pruning)?
   
   Partitioning would benefit the query performance obviously. But for writing 
itself, the data size matter more, I would say. 
   
   > We have noticed that the most time spent in the data indexing (the 
bulk-insert logic itself) and not the sorting stages/operation before the 
indexing, so how can we improve that? should we provide our own indexing logic?
   
   Nope. you don't have to supply you own indexing or anthing. Bulk insert does 
not do any indexing, it does a global sort (so we can pack records belonging to 
same partition closer into the same file as much) and then writes out files. 
   
   
   **Few observations :** 
   
   - 47 min job is gc-ing quite a bit. So that can affect throughput a lot. 
Have you tried configuring the jvm.
   - I do see fair bit of skews here from sorting, which may be affecting over 
all run times.. #1149 is trying to also provide a non-sorted mode, that 
tradeoffs file sizing for potentially faster writing.
   
   On what could create difference between bulk_insert and spark/parquet :
   
   - I would also set `"hoodie.parquet.compression.codec" -> "SNAPPY"` since 
Hudi uses gzip compression by default, where spark.write.parquet will use 
SNAPPY 
   - Hudi currently does an extra `df.rdd` conversion that could affect 
bulk_insert/insert (upsert/delete workloads are bound by merge costs, this 
matters less there). I don't see that in your UI though..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

Reply via email to