[GitHub] spark pull request: [SPARK-8968] [SQL] shuffled by the partition c...

liancheng Thu, 09 Jul 2015 20:05:25 -0700

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/7336#issuecomment-120209897
  
    I'm thinking maybe we should make the change in 
`InsertIntoHiveTable.sideEffectResult`, where the writer container gets 
created. So that you don't bother doing a pattern match later.
    
    Another high level comment is that, although this change does work for your 
workload, the following statement made in the PR description isn't correct:
    
    > This patch we shuffle data by the partition columns firstly so that each 
partition will have ony one partition file and this also reduce the gc overhead.
    
    By repartitioning the dataset by dynamic partition columns, you potentially 
reduce the number of dynamic partitions handled per task (that's why it reduces 
GC overhead), but the number can't be guaranteed to be reduced to 1.
    
    Actually we are also considering to improve dynamic partitioning insertion 
via local sorting (sort by partition columns with the spillable 
`ExternalSorter`). Because when writing sorted data, a task only need to open a 
single writer, and local sorting doesn't require shuffling.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8968] [SQL] shuffled by the partition c...

Reply via email to