jonkeane commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-834440235


   Oh, knowing about missing values is helpful, lemme dig more into that and 
see if I can replicate performance differences on those.
   
   Here's `summary()` of the fanniemae dataset, and that are a decent chunk of 
NAs in it (and has types too):
   
   ```
   > summary(df_fannie)
          f0                    f1                 f2                  f3       
    Min.   :100001420754   Length:22180168    Length:22180168    Min.   :1.750  
    1st Qu.:326084086722   Class :character   Class :character   1st Qu.:3.375  
    Median :550659611473   Mode  :character   Mode  :character   Median :3.500  
    Mean   :550440259075                                         Mean   :3.561  
    3rd Qu.:775451076920                                         3rd Qu.:3.750  
    Max.   :999999800242                                         Max.   :5.900  
                                                                                
          f4                f5              f6              f7       
    Min.   :      0   Min.   :-1.00   Min.   : 27.0   Min.   :  0.0  
    1st Qu.: 136660   1st Qu.: 8.00   1st Qu.:220.0   1st Qu.:214.0  
    Median : 206558   Median :16.00   Median :336.0   Median :333.0  
    Mean   : 226014   Mean   :16.65   Mean   :292.9   Mean   :287.5  
    3rd Qu.: 299858   3rd Qu.:25.00   3rd Qu.:348.0   3rd Qu.:347.0  
    Max.   :1203000   Max.   :58.00   Max.   :482.0   Max.   :480.0  
    NA's   :4058158                                   NA's   :28840  
         f8                  f9            f10                f11           
    Length:22180168    Min.   :    0   Length:22180168    Length:22180168   
    Class :character   1st Qu.:17460   Class :character   Class :character  
    Mode  :character   Median :31080   Mode  :character   Mode  :character  
                       Mean   :28225                                        
                       3rd Qu.:39580                                        
                       Max.   :49740                                        
                                                                            
         f12               f13                f14                f15           
    Min.   : 1         Length:22180168    Length:22180168    Length:22180168   
    1st Qu.: 1         Class :character   Class :character   Class :character  
    Median : 1         Mode  :character   Mode  :character   Mode  :character  
    Mean   : 1                                                                 
    3rd Qu.: 1                                                                 
    Max.   :16                                                                 
    NA's   :22061889                                                           
        f16                 f17                f18                f19          
    Length:22180168    Min.   :    3      Min.   :  187      Min.   :  65      
    Class :character   1st Qu.: 2945      1st Qu.:  730      1st Qu.:1319      
    Mode  :character   Median : 4658      Median : 3679      Median :2500      
                       Mean   : 5143      Mean   : 6808      Mean   :2561      
                       3rd Qu.: 7026      3rd Qu.: 7542      3rd Qu.:2605      
                       Max.   :23055      Max.   :55625      Max.   :9900      
                       NA's   :22180014   NA's   :22180087   NA's   :22180124  
         f20                f21                f22                f23          
    Min.   :-3561      Min.   :   87      Min.   :  4911     Min.   :   284    
    1st Qu.:  -34      1st Qu.: 1089      1st Qu.: 73622     1st Qu.: 12374    
    Median :  869      Median : 2214      Median :127763     Median : 20665    
    Mean   : 1345      Mean   : 3399      Mean   :147056     Mean   : 36262    
    3rd Qu.: 1877      3rd Qu.: 3980      3rd Qu.:198586     3rd Qu.: 40433    
    Max.   :36497      Max.   :24840      Max.   :465825     Max.   :539401    
    NA's   :22180041   NA's   :22180053   NA's   :22180023   NA's   :22180081  
         f24                f25                f26             f27          
    Min.   :126773     Min.   :     0     Min.   :     0     Mode:logical   
    1st Qu.:126773     1st Qu.:   110     1st Qu.:     0     NA's:22180168  
    Median :126773     Median :   500     Median :     0                    
    Mean   :126773     Mean   : 14636     Mean   :  2807                    
    3rd Qu.:126773     3rd Qu.:  2846     3rd Qu.:     0                    
    Max.   :126773     Max.   :328871     Max.   :129946                    
    NA's   :22180167   NA's   :22180095   NA's   :22151328                  
        f28              f29               f30           
    Length:22180168    Mode:logical    Length:22180168   
    Class :character   NA's:22180168   Class :character  
    Mode  :character                   Mode  :character  
   ```
   
   I also have been digging into differences across types. Factors seem to 
parallelize really well, so I tried to convert the chitraffic data frame which 
is a mic of strings + numerics + 2 factor columns, and when I do that (with 12 
cpu cores available) the most I’m seeing the CPU get to is ~140% and even that 
is only briefly, most of the time the process is at 100%
   
   ```
   > system.time(tab_chi_traffic <- arrow::Table$create(df_chi_traffic))
      user  system elapsed 
    29.093   0.797  28.002 
   ```
   
   I then created a silly version of this dataset where I converted each of the 
columns into a factor (totally naively with as.factor()), and converting that 
is about half the time + the cpu usage peaks at ~300% though it drops down to 
100% and then bumps back up a few times
   
   ```
   > system.time(tab_chi_traffic <- arrow::Table$create(df_chi_traffic_factors))
      user  system elapsed 
    31.073   1.194  15.857 
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to