jonkeane commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-834795030


   Here's another example of trying a data.frame of strings and not seeing 
parallelization, but converting those strings to factors and boom we get 
parallelization:
   
   ```
   > library(arrow)
   
   Attaching package: ‘arrow’
   
   The following object is masked from ‘package:utils’:
   
       timestamp
   
   > 
   > # this sample is located at 
https://ursa-qa.s3.amazonaws.com/single_types/type_strings.parquet
   > # it is 1M rows, 5 columns. The first column has no missing, the second 
has 10% missing, 
   > # the third 25% missing, the fourth 50% missing, and the 5th 90% missing.
   > strings_df <- read_parquet("~/repos/ab_store/data/type_strings.parquet")
   > 
   > # embiggen so that the transform differences are easier to see (and so we 
have more columns than cores)
   > strings_df <- dplyr::bind_cols(strings_df, strings_df, strings_df)
   New names:
   * jane -> jane...1
   * austen -> austen...2
   * sense -> sense...3
   * and -> and...4
   * sensibility -> sensibility...5
   * ...
   > strings_df <- dplyr::bind_rows(strings_df, strings_df, strings_df, 
strings_df, strings_df)
   > 
   > summary(strings_df)
      jane...1          austen...2         sense...3           and...4         
    Length:5000000     Length:5000000     Length:5000000     Length:5000000    
    Class :character   Class :character   Class :character   Class :character  
    Mode  :character   Mode  :character   Mode  :character   Mode  :character  
    sensibility...5      jane...6          austen...7         sense...8        
    Length:5000000     Length:5000000     Length:5000000     Length:5000000    
    Class :character   Class :character   Class :character   Class :character  
    Mode  :character   Mode  :character   Mode  :character   Mode  :character  
      and...9          sensibility...10    jane...11         austen...12       
    Length:5000000     Length:5000000     Length:5000000     Length:5000000    
    Class :character   Class :character   Class :character   Class :character  
    Mode  :character   Mode  :character   Mode  :character   Mode  :character  
     sense...13          and...14         sensibility...15  
    Length:5000000     Length:5000000     Length:5000000    
    Class :character   Class :character   Class :character  
    Mode  :character   Mode  :character   Mode  :character  
   > 
   > # when this runs, my cpu usage is always at 100% or slightly below (and 
note 
   > # that user <= elapsed below)
   > system.time(tab <- Table$create(strings_df))
      user  system elapsed 
    31.855   0.842  32.806 
   > 
   > 
   > # naively turn the strings into factors:
   > strings_as_factors_df <- dplyr::mutate(strings_df, dplyr::across(.fns = 
as.factor))
   > 
   > summary(strings_as_factors_df)
       jane...1         austen...2        sense...3          and...4       
    Elino  :   1210   Elinor :   1140          :    915   Elino  :    625  
    Elinor :   1120   Elino  :   1125   Elinor :    895   Elinor :    625  
           :   1115          :    895   Elino  :    890          :    545  
    Mariann:    890   Maria  :    745   Marian :    705   Marian :    480  
    Elinor :    880   Elinor :    725   Elinor :    670   Elinor :    455  
    Marian :    865   (Other):4491725   (Other):3746365   (Other):2497925  
    (Other):4993920   NA's   : 503645   NA's   :1249560   NA's   :2499345  
    sensibility...5      jane...6         austen...7        sense...8      
    Elino  :     65   Elino  :   1210   Elinor :   1140          :    915  
    Marian :     65   Elinor :   1120   Elino  :   1125   Elinor :    895  
    Maria  :     60          :   1115          :    895   Elino  :    890  
    Elinor :     55   Mariann:    890   Maria  :    745   Marian :    705  
    Elinor :     55   Elinor :    880   Elinor :    725   Elinor :    670  
    (Other): 249925   Marian :    865   (Other):4491725   (Other):3746365  
    NA's   :4749775   (Other):4993920   NA's   : 503645   NA's   :1249560  
       and...9        sensibility...10    jane...11        austen...12     
    Elino  :    625   Elino  :     65   Elino  :   1210   Elinor :   1140  
    Elinor :    625   Marian :     65   Elinor :   1120   Elino  :   1125  
           :    545   Maria  :     60          :   1115          :    895  
    Marian :    480   Elinor :     55   Mariann:    890   Maria  :    745  
    Elinor :    455   Elinor :     55   Elinor :    880   Elinor :    725  
    (Other):2497925   (Other): 249925   Marian :    865   (Other):4491725  
    NA's   :2499345   NA's   :4749775   (Other):4993920   NA's   : 503645  
      sense...13         and...14       sensibility...15 
           :    915   Elino  :    625   Elino  :     65  
    Elinor :    895   Elinor :    625   Marian :     65  
    Elino  :    890          :    545   Maria  :     60  
    Marian :    705   Marian :    480   Elinor :     55  
    Elinor :    670   Elinor :    455   Elinor :     55  
    (Other):3746365   (Other):2497925   (Other): 249925  
    NA's   :1249560   NA's   :2499345   NA's   :4749775  
   > 
   > 
   > # when this runs, my cpu usage goes up to 400% (and user >> elapsed below)
   > system.time(tab <- Table$create(strings_as_factors_df))
      user  system elapsed 
    31.166   0.794   6.184 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to