[GitHub] [arrow] cyb70289 commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


cyb70289 commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647914768


   > I'm refactoring to nix util::optional. I'm too tired to finish it tonight 
so I'll work on it tomorrow morning. If the perf regression isn't gone I'll 
rewrite the sort kernels.
   
   Very likely I'm wrong. I remember util::optional is added due to CI failure 
https://github.com/apache/arrow/pull/6495#issuecomment-593732821
   
   I think this patch is okay.
   Sorting regression can be fixed (maybe improved). I'm okay to do the follow 
up changes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #6402: ARROW-7831: [Java] do not allocate a new offset buffer if the slice starts at 0 since the relative offset pointer would be unchanged

2020-06-22 Thread GitBox


emkornfield commented on pull request #6402:
URL: https://github.com/apache/arrow/pull/6402#issuecomment-647907608


   @jacques-n are you happy with the changes?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on a change in pull request #7321: ARROW-8985: [Format][DONOTMERGE] RFC Proposed Decimal::byteWidth field for forward compatibility

2020-06-22 Thread GitBox


emkornfield commented on a change in pull request #7321:
URL: https://github.com/apache/arrow/pull/7321#discussion_r443958283



##
File path: format/Schema.fbs
##
@@ -134,11 +134,20 @@ table FixedSizeBinary {
 table Bool {
 }
 
+/// Exact decimal value represented as an integer value in two's
+/// complement. Currently only 128-bit (16-byte) integers are used but this may

Review comment:
   endianness?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] emkornfield commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-22 Thread GitBox


emkornfield commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-647903916


   It would be great if @jacques-n could confirm he is OK with these changes.  
I believe on a previous PR the extra indirection was a concern.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647888437


   FWIW the performance issue seems to be more pronounced on gcc than clang, 
here is the benchmark comparison on my machine with clang-8
   
   ```
 benchmark baseline 
   contender  change %   counters
   1 SortToIndicesInt64Count/32768/1/min_time:1.0001.560 GiB/sec
2.000 GiB/sec28.163{'iterations': 70030, 'null_percent': 0.01}
   15  SortToIndicesInt64Compare/32768/1/min_time:1.000  145.735 MiB/sec  
158.918 MiB/sec 9.046 {'iterations': 6654, 'null_percent': 0.01}
   5 SortToIndicesInt64Compare/32768/100/min_time:1.000  149.117 MiB/sec  
159.609 MiB/sec 7.036  {'iterations': 6545, 'null_percent': 1.0}
   7   SortToIndicesInt64Compare/32768/0/min_time:1.000  153.027 MiB/sec  
162.227 MiB/sec 6.012  {'iterations': 6862, 'null_percent': 0.0}
   4  SortToIndicesInt64Compare/32768/10/min_time:1.000  160.419 MiB/sec  
167.725 MiB/sec 4.554 {'iterations': 6934, 'null_percent': 10.0}
   2   SortToIndicesInt64Compare/32768/2/min_time:1.000  255.024 MiB/sec  
260.284 MiB/sec 2.063{'iterations': 11390, 'null_percent': 50.0}
   9   SortToIndicesInt64Count/32768/100/min_time:1.0001.486 GiB/sec
1.458 GiB/sec-1.912 {'iterations': 66757, 'null_percent': 1.0}
   10SortToIndicesInt64Count/32768/0/min_time:1.0002.143 GiB/sec
2.067 GiB/sec-3.568 {'iterations': 98191, 'null_percent': 0.0}
   13  SortToIndicesInt64Count/8388608/1/min_time:1.0004.215 GiB/sec
3.813 GiB/sec-9.531 {'iterations': 762, 'null_percent': 100.0}
   11SortToIndicesInt64Count/32768/2/min_time:1.000  679.023 MiB/sec  
609.379 MiB/sec   -10.256{'iterations': 29602, 'null_percent': 50.0}
   0   SortToIndicesInt64Count/1048576/1/min_time:1.0004.487 GiB/sec
4.021 GiB/sec   -10.400{'iterations': 6550, 'null_percent': 100.0}
   12SortToIndicesInt64Compare/8388608/1/min_time:1.0004.250 GiB/sec
3.762 GiB/sec   -11.476 {'iterations': 766, 'null_percent': 100.0}
   6 SortToIndicesInt64Count/32768/1/min_time:1.0004.758 GiB/sec
4.185 GiB/sec   -12.040  {'iterations': 217705, 'null_percent': 100.0}
   8   SortToIndicesInt64Compare/32768/1/min_time:1.0004.730 GiB/sec
4.125 GiB/sec   -12.780  {'iterations': 213908, 'null_percent': 100.0}
   3 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.556 GiB/sec
3.953 GiB/sec   -13.228{'iterations': 6539, 'null_percent': 100.0}
   14   SortToIndicesInt64Count/32768/10/min_time:1.0001.316 GiB/sec
1.051 GiB/sec   -20.108{'iterations': 59539, 'null_percent': 10.0}
   ```
   
   The perf regression with the data 100% null is an artifact of the improper 
implementation



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647888437


   FWIW the performance issue seems to be more pronounced on gcc than clang, 
here is the benchmark comparison on my machine with clang-8
   
   ```
 benchmark baseline 
   contender  change %   counters
   1 SortToIndicesInt64Count/32768/1/min_time:1.0001.560 GiB/sec
2.000 GiB/sec28.163{'iterations': 70030, 'null_percent': 0.01}
   15  SortToIndicesInt64Compare/32768/1/min_time:1.000  145.735 MiB/sec  
158.918 MiB/sec 9.046 {'iterations': 6654, 'null_percent': 0.01}
   5 SortToIndicesInt64Compare/32768/100/min_time:1.000  149.117 MiB/sec  
159.609 MiB/sec 7.036  {'iterations': 6545, 'null_percent': 1.0}
   7   SortToIndicesInt64Compare/32768/0/min_time:1.000  153.027 MiB/sec  
162.227 MiB/sec 6.012  {'iterations': 6862, 'null_percent': 0.0}
   4  SortToIndicesInt64Compare/32768/10/min_time:1.000  160.419 MiB/sec  
167.725 MiB/sec 4.554 {'iterations': 6934, 'null_percent': 10.0}
   2   SortToIndicesInt64Compare/32768/2/min_time:1.000  255.024 MiB/sec  
260.284 MiB/sec 2.063{'iterations': 11390, 'null_percent': 50.0}
   9   SortToIndicesInt64Count/32768/100/min_time:1.0001.486 GiB/sec
1.458 GiB/sec-1.912 {'iterations': 66757, 'null_percent': 1.0}
   10SortToIndicesInt64Count/32768/0/min_time:1.0002.143 GiB/sec
2.067 GiB/sec-3.568 {'iterations': 98191, 'null_percent': 0.0}
   13  SortToIndicesInt64Count/8388608/1/min_time:1.0004.215 GiB/sec
3.813 GiB/sec-9.531 {'iterations': 762, 'null_percent': 100.0}
   11SortToIndicesInt64Count/32768/2/min_time:1.000  679.023 MiB/sec  
609.379 MiB/sec   -10.256{'iterations': 29602, 'null_percent': 50.0}
   0   SortToIndicesInt64Count/1048576/1/min_time:1.0004.487 GiB/sec
4.021 GiB/sec   -10.400{'iterations': 6550, 'null_percent': 100.0}
   12SortToIndicesInt64Compare/8388608/1/min_time:1.0004.250 GiB/sec
3.762 GiB/sec   -11.476 {'iterations': 766, 'null_percent': 100.0}
   6 SortToIndicesInt64Count/32768/1/min_time:1.0004.758 GiB/sec
4.185 GiB/sec   -12.040  {'iterations': 217705, 'null_percent': 100.0}
   8   SortToIndicesInt64Compare/32768/1/min_time:1.0004.730 GiB/sec
4.125 GiB/sec   -12.780  {'iterations': 213908, 'null_percent': 100.0}
   3 SortToIndicesInt64Compare/1048576/1/min_time:1.0004.556 GiB/sec
3.953 GiB/sec   -13.228{'iterations': 6539, 'null_percent': 100.0}
   14   SortToIndicesInt64Count/32768/10/min_time:1.0001.316 GiB/sec
1.051 GiB/sec   -20.108{'iterations': 59539, 'null_percent': 10.0}
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647887135


   I'm refactoring to nix util::optional. I'm too tired to finish it tonight so 
I'll work on it tomorrow morning. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647887135


   I'm refactoring to nix util::optional. I'm too tired to finish it tonight so 
I'll work on it tomorrow morning. If the perf regression isn't gone I'll 
rewrite the sort kernels. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647887135


   I'm refactoring to nix util::optional. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647886669


   Also, I don't really understand the use of `util::optional` in these 
templates. The user should pass separate lambdas for the not-null and null cases



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884747


   Sorting seems too important to leave it to these relatively complex 
templates (for example, just after determining that a value is not null, the 
`optional` value is checked whether it's null again!), I would suggest 
implementing the counting sort without using `VisitArrayDataInline`. I'm happy 
to help with this. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884747


   Sorting seems too important to leave it to these relatively complex 
templates, I would suggest implementing the counting sort without using 
`VisitArrayDataInline`. I'm happy to help with this. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884747


   Sorting seems too important to leave it to these relatively complex 
templates, I would suggest implementing the counting sort without using 
`VisitArrayDataInline`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] cyb70289 commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


cyb70289 commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647884195


   I see big performance drop from some counting sort cases, also tested on my 
local machine. Should be related to these visitor code: 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L133-L155



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] ursabot commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


ursabot commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647883517


   [AMD64 Ubuntu 18.04 C++ Benchmark 
(#114347)](https://ci.ursalabs.org/#builders/73/builds/89) builder has been 
succeeded.
   
   Revision: dbd166df749e73cbf7c1ec0c6cfa5837280aa32d
   
   ```diff
   ===  
===  
 benchmark baseline 
contenderchange
   ===  
===  
 SortToIndicesInt64Count/32768/1/min_time:1.0002.690 GiB/sec
2.646 GiB/sec-1.654%
 SortToIndicesInt64Count/1048576/1/min_time:1.000  3.244 GiB/sec
3.198 GiB/sec-1.423%
 SortToIndicesInt64Compare/32768/1/min_time:1.000  103.004 MiB/sec  
101.724 MiB/sec  -1.243%
 SortToIndicesInt64Compare/32768/1/min_time:1.000  2.685 GiB/sec
2.612 GiB/sec-2.707%
 SortToIndicesInt64Compare/32768/0/min_time:1.000  105.027 MiB/sec  
103.783 MiB/sec  -1.184%
   - SortToIndicesInt64Compare/32768/10/min_time:1.000 109.648 MiB/sec  
102.376 MiB/sec  -6.633%
   - SortToIndicesInt64Count/32768/10/min_time:1.000   701.425 MiB/sec  
286.420 MiB/sec  -59.166%
   - SortToIndicesInt64Count/32768/100/min_time:1.000  686.441 MiB/sec  
386.614 MiB/sec  -43.678%
 SortToIndicesInt64Compare/8388608/1/min_time:1.0003.162 GiB/sec
3.201 GiB/sec1.242%
   - SortToIndicesInt64Count/32768/2/min_time:1.000526.866 MiB/sec  
259.139 MiB/sec  -50.815%
   - SortToIndicesInt64Count/32768/1/min_time:1.000683.857 MiB/sec  
599.732 MiB/sec  -12.301%
 SortToIndicesInt64Compare/32768/100/min_time:1.000103.157 MiB/sec  
98.649 MiB/sec   -4.370%
 SortToIndicesInt64Count/8388608/1/min_time:1.000  3.259 GiB/sec
3.211 GiB/sec-1.495%
 SortToIndicesInt64Count/32768/0/min_time:1.000647.629 MiB/sec  
627.171 MiB/sec  -3.159%
 SortToIndicesInt64Compare/1048576/1/min_time:1.0003.197 GiB/sec
3.198 GiB/sec0.035%
   - SortToIndicesInt64Compare/32768/2/min_time:1.000  171.750 MiB/sec  
162.637 MiB/sec  -5.306%
   ===  
===  
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


wesm commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647883524


   I improved the output to show the `state.counters` stuff
   
   ```
 benchmark baselinecontender  change %  
  counters
   40UniqueInt64/56.442 GiB/sec   18.346 GiB/sec   184.782  
{'iterations': 145, 'null_percent': 100.0}
   0UniqueInt64/116.500 GiB/sec   18.364 GiB/sec   182.522  
{'iterations': 145, 'null_percent': 100.0}
   11UniqueUInt8/5  812.047 MiB/sec1.755 GiB/sec   121.298  
{'iterations': 142, 'null_percent': 100.0}
   7 UniqueUInt8/1  683.943 MiB/sec1.253 GiB/sec87.593
{'iterations': 117, 'null_percent': 0.1}
   38UniqueUInt8/4  762.983 MiB/sec  950.521 MiB/sec24.580   
{'iterations': 133, 'null_percent': 99.0}
   29UniqueUInt8/2  659.082 MiB/sec  820.410 MiB/sec24.478
{'iterations': 114, 'null_percent': 1.0}
   5 UniqueInt64/12.656 GiB/sec3.300 GiB/sec24.223 
{'iterations': 60, 'null_percent': 0.1}
   32UniqueInt64/45.627 GiB/sec6.772 GiB/sec20.349   
{'iterations': 119, 'null_percent': 99.0}
   25   UniqueInt64/105.234 GiB/sec6.294 GiB/sec20.254   
{'iterations': 110, 'null_percent': 99.0}
   39  UniqueString100bytes/11   26.815 GiB/sec   31.122 GiB/sec16.061   
{'iterations': 48, 'null_percent': 100.0}
   23UniqueString10bytes/52.691 GiB/sec3.113 GiB/sec15.667   
{'iterations': 48, 'null_percent': 100.0}
   34   UniqueString100bytes/5   26.944 GiB/sec   31.015 GiB/sec15.108   
{'iterations': 48, 'null_percent': 100.0}
   6UniqueString10bytes/112.699 GiB/sec3.096 GiB/sec14.721   
{'iterations': 49, 'null_percent': 100.0}
   21   UniqueString100bytes/71.947 GiB/sec2.217 GiB/sec13.866  
{'iterations': 3, 'null_percent': 0.1}
   28UniqueInt64/22.622 GiB/sec2.904 GiB/sec10.770 
{'iterations': 59, 'null_percent': 1.0}
   13UniqueInt64/32.157 GiB/sec2.343 GiB/sec 8.644
{'iterations': 48, 'null_percent': 10.0}
   33   UniqueString100bytes/4   24.286 GiB/sec   26.030 GiB/sec 7.181
{'iterations': 43, 'null_percent': 99.0}
   22UniqueInt64/72.542 GiB/sec2.707 GiB/sec 6.497 
{'iterations': 56, 'null_percent': 0.1}
   20  UniqueString100bytes/10   22.536 GiB/sec   23.985 GiB/sec 6.432
{'iterations': 40, 'null_percent': 99.0}
   35UniqueString10bytes/1  788.817 MiB/sec  836.008 MiB/sec 5.983 
{'iterations': 14, 'null_percent': 0.1}
   17UniqueString10bytes/7  592.671 MiB/sec  628.054 MiB/sec 5.970 
{'iterations': 10, 'null_percent': 0.1}
   3 UniqueString10bytes/42.515 GiB/sec2.658 GiB/sec 5.687
{'iterations': 45, 'null_percent': 99.0}
   19   UniqueString10bytes/102.402 GiB/sec2.529 GiB/sec 5.269
{'iterations': 42, 'null_percent': 99.0}
   9UniqueString100bytes/13.929 GiB/sec4.077 GiB/sec 3.762  
{'iterations': 7, 'null_percent': 0.1}
   30UniqueString10bytes/8  593.560 MiB/sec  610.253 MiB/sec 2.812 
{'iterations': 10, 'null_percent': 1.0}
   12UniqueString10bytes/2  788.505 MiB/sec  808.396 MiB/sec 2.523 
{'iterations': 14, 'null_percent': 1.0}
   37   UniqueString100bytes/81.965 GiB/sec1.998 GiB/sec 1.697  
{'iterations': 3, 'null_percent': 1.0}
   1UniqueString100bytes/23.984 GiB/sec4.025 GiB/sec 1.028  
{'iterations': 7, 'null_percent': 1.0}
   36   UniqueString100bytes/34.262 GiB/sec4.293 GiB/sec 0.725 
{'iterations': 8, 'null_percent': 10.0}
   8 BuildStringDictionary   85.507 MiB/sec   85.687 MiB/sec 0.211  
   {'iterations': 198}
   16   UniqueString100bytes/92.121 GiB/sec2.111 GiB/sec-0.469 
{'iterations': 4, 'null_percent': 10.0}
   4UniqueString100bytes/62.056 GiB/sec2.043 GiB/sec-0.626  
{'iterations': 4, 'null_percent': 0.0}
   10UniqueUInt8/3  453.281 MiB/sec  448.407 MiB/sec-1.075
{'iterations': 79, 'null_percent': 10.0}
   14   UniqueString100bytes/04.100 GiB/sec4.055 GiB/sec-1.089  
{'iterations': 7, 'null_percent': 0.0}
   24UniqueInt64/82.473 GiB/sec2.443 GiB/sec-1.202 
{'iterations': 55, 'null_percent': 1.0}
   26UniqueString10bytes/9  615.880 MiB/sec  608.453 MiB/sec-1.206
{'iterations': 11, 'null_percent': 10.0}
   42UniqueString10bytes/6  651.430 MiB/sec  640.128 MiB/sec-1.735 
{'iterations': 11, 'null_percent': 0.0}
   27UniqueUInt8/01.775 GiB/sec1.738 GiB/sec-2.063
{'iterations': 318, 'null_percent': 0.0}
   31UniqueInt64/92.076 GiB/sec2.033 GiB/sec-2.067
{'iterations': 46, 'null_percent': 10.0}
   15  

[GitHub] [arrow] cyb70289 commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


cyb70289 commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647878260


   @ursabot benchmark --suite-filter=arrow-compute-vector-sort-benchmark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox


mrkn commented on a change in pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#discussion_r443928439



##
File path: python/pyarrow/tensor.pxi
##
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
 for x in dim_names:
 c_dim_names.push_back(tobytes(x))
 
-coords = np.vstack([obj.row, obj.col]).T
+row = obj.row
+col = obj.col
+if obj.has_canonical_format:

Review comment:
   The attribute [actually 
exists](https://github.com/scipy/scipy/blob/master/scipy/sparse/coo.py#L137) 
but not documented.  It may be an internal attribute, but it does not have `_` 
prefix, so I detect it can be used.  I need to refer this attribute to 
recognize whether or not a given SciPy's coo_matrix has canonical format.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox


mrkn commented on a change in pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#discussion_r443922882



##
File path: python/pyarrow/tensor.pxi
##
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
 for x in dim_names:
 c_dim_names.push_back(tobytes(x))
 
-coords = np.vstack([obj.row, obj.col]).T
+row = obj.row
+col = obj.col
+if obj.has_canonical_format:
+order = np.lexsort((col, row))  # row-major order

Review comment:
   SciPy's coo_matrix employs column-major order, but our SparseCOOIndex 
employs row-major order.
   This is the reason why we need to sort the index here.
   
   The reason why our SparseCOOIndex should employ row-major order is that 
TensorFlow's SparseTensor employs row-major order.  Converting from SciPy's 
coo_matrrix needs to copy the index matrix even if we employ column-major 
order, but converting from TensorFlow's SparseTensor does not need to copy the 
index matrix if we employ row-major order.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: [C++][Python] Add canonical flag in COO sparse index

2020-06-22 Thread GitBox


mrkn commented on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-647860707


   > I don't understand the PR's motivation. You say that:
   > 
   > > To support the integration with scipy.sparse.coo_matrix, it is necessary 
to add a flag in SparseCOOIndex
   > > but the PR doesn't showcase it at all. There is no added integration.
   
   @pitrou I added the description as some comments at the appropriate places 
in the additional commit 1e0b28c3d.  Can these comments give you an 
understanding?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-647860226


   @kou utf8proc should only be used in a small number of compilation units, so 
what do you think about just using `set_source_files_properties(... PROPERTIES 
COMPILE_DEFINITIONS UTF8PROC_STATIC)` in those files? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox


wesm commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-647860226


   @kou utf8proc should only be used in a small number of compilation units, so 
what do you think about just using `set_target_properties(... PROPERTIES 
COMPILE_DEFINITIONS UTF8PROC_STATIC)` in those files? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-22 Thread GitBox


wesm commented on a change in pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#discussion_r443914793



##
File path: docs/source/developers/contributing.rst
##
@@ -17,53 +17,73 @@
 
 .. _contributing:
 
-***
-Contribution Guidelines
-***
+
+Contributing to Apache Arrow
+
 
-There are many ways to contribute to Apache Arrow:
+Thanks for your interest in the Apache Arrow project. Arrow is a large project
+and may seem overwhelming when you're first getting involved.
+Contributing code is great, but that's probably not the first place to start.
+There are lots of ways to make valuable contributions to the project and
+community.
 
-* Contributing code (we call them "patches")
-* Writing documentation (another form of code, in a way)
-* Participating in discussions on `JIRA 
`_ or the `mailing list 
`_
-* Helping users of the libraries
-* Reporting bugs and asking questions
+This page provides some orientation for how to get involved. It also offers
+some recommendations on how to get best results when engaging with the
+community.
 
-Mailing List
-
+Join the mailing lists
+==
 
+A good first step to getting involved in the Arrow project is to join the
+mailing lists and participate in discussions where you can.
 Projects in The Apache Software Foundation ("the ASF") use public, archived
 mailing lists to create a public record of each project's development
-activities and decision-making process. As such, all contributors generally
-must be subscribed to the d...@arrow.apache.org mailing list to participate in
-the community.
+activities and decision-making process.
+While lacking the immediacy of chat or other forms of communication,
+the mailing lists give participants the opportunity to slow down and be
+thoughtful in their responses, and they help developers who are spread across
+many timezones to participate more equally.

Review comment:
   Good. 

##
File path: docs/source/developers/contributing.rst
##
@@ -76,46 +96,83 @@ visibility. They may add a "Fix version" to indicate that 
they're considering
 it for inclusion in the next release, though adding that tag is not a
 commitment that it will be done in the next release.
 
-Advanced use
-
-
-Once you are involved in the project and want to do more on JIRA, such as
-assign yourself an issue, you will need "Contributor" permissions on the
-Apache Arrow JIRA. To get this role, ask on the mailing list for a project
-maintainer's help.
-
-GitHub issues
--
-
-We support `GitHub issues `_ as a
-lightweight way to ask questions and engage with
-the Arrow developer community. We use JIRA for maintaining a queue of
-development work and as the public record for work on the project. So, feel
-free to open GitHub issues, but bugs and feature requests will eventually need
-to end up in JIRA, either before or after completing a pull request. Don't be
-surprised if you are immediately asked by a project maintainer to open a JIRA
-issue.
-
-How to contribute patches
-=
-
-We prefer to receive contributions in the form of GitHub pull requests. Please
-send pull requests against the `github.com/apache/arrow
-`_ repository following the procedure below.
-
-If you are looking for some ideas on what to contribute, check out the JIRA
-issues for the Apache Arrow project. Comment on the issue and/or contact
-d...@arrow.apache.org with your questions and ideas.
-
-If you’d like to report a bug but don’t have time to fix it, you can still post
-it on JIRA, or email the mailing list d...@arrow.apache.org.
+Tips for successful bug reports

+
+No one likes having bugs in their software, and in an ideal world, all bugs
+would get fixed as soon as they were reported. However, time and attention are
+finite, especially in an open-source project where most contributors are
+participating in their spare time. In order for your bug to get prompt

Review comment:
   May be worth emphasizing here that all contributors in Apache projects 
are volunteers and act as individuals, even if they are contributing to the 
project as part of their job responsibilities. 

##
File path: docs/source/developers/contributing.rst
##
@@ -17,53 +17,73 @@
 
 .. _contributing:
 
-***
-Contribution Guidelines
-***
+
+Contributing to Apache Arrow
+
 
-There are many ways to contribute to Apache Arrow:
+Thanks for your interest in the Apache Arrow project. Arrow is a large project
+and may seem overwhelming when you're first getting involved.

[GitHub] [arrow] github-actions[bot] commented on pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7522:
URL: https://github.com/apache/arrow/pull/7522#issuecomment-647857111


   https://issues.apache.org/jira/browse/ARROW-8801



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-22 Thread GitBox


wesm commented on a change in pull request #7522:
URL: https://github.com/apache/arrow/pull/7522#discussion_r443913968



##
File path: cpp/src/arrow/python/arrow_to_pandas.cc
##
@@ -1269,13 +1283,18 @@ class DatetimeTZWriter : public DatetimeNanoWriter {
   : DatetimeNanoWriter(options, num_rows, 1), timezone_(timezone) {}
 
  protected:
-  Status GetResultBlock(PyObject** out) override { return GetBlock1D(out); }
+  Status GetResultBlock(PyObject** out) override {
+RETURN_NOT_OK(MakeBlock1D());
+*out = block_arr_.obj();
+return Status::OK();
+  }
 
   Status AddResultMetadata(PyObject* result) override {
 PyObject* py_tz = PyUnicode_FromStringAndSize(
 timezone_.c_str(), static_cast(timezone_.size()));
 RETURN_IF_PYERROR();
 PyDict_SetItemString(result, "timezone", py_tz);
+Py_DECREF(py_tz);

Review comment:
   This was an small memory leak that users are unlikely to observe but I 
found it while looking at usages of PyDict_SetItemString (which increments the 
reference count of the passed dict value)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7522: ARROW-8801: [Python] Fix memory leak when converting datetime64-with-tz data to pandas

2020-06-22 Thread GitBox


wesm opened a new pull request #7522:
URL: https://github.com/apache/arrow/pull/7522


   An array object was failing to be decref'd on the DatetimeTZ conversion 
path. The code is slightly complicated by the different calling reference 
ownership semantics of the Array/ChunkedArray conversion path (which expects to 
own the created array when it calls `GetSeriesResult` while the 
`GetResultBlock` code retains its array in a `OwnedRefNoGIL`). This was the 
simplest thing that fixed the memory leak for me. There is leak checking code 
that can be used to verify this in python/scripts/test_leak.py (just run the 
script). 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kiszk commented on a change in pull request #7507: ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian

2020-06-22 Thread GitBox


kiszk commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r443912778



##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -409,6 +414,32 @@ class IpcTestFixture : public io::MemoryMapFixture, public 
ExtensionTypesMixin {
 CheckRoundtrip(*batch, options, IpcReadOptions::Defaults(), buffer_size);
   }
 
+  void CheckRoundTrip(const RecordBatch& src_batch,
+  const RecordBatch& expected_batch,
+  IpcWriteOptions options = IpcWriteOptions::Defaults(),
+  IpcReadOptions read_options = IpcReadOptions::Defaults(),
+  int64_t buffer_size = 1 << 20) {
+std::stringstream ss;
+ss << "test-write-row-batch-" << g_file_number++;
+ASSERT_OK_AND_ASSIGN(mmap_,
+ io::MemoryMapFixture::InitMemoryMap(buffer_size, 
ss.str()));
+
+DictionaryMemo dictionary_memo;
+
+std::shared_ptr schema_result;
+DoSchemaRoundTrip(*src_batch.schema(), _memo, _result);
+ASSERT_TRUE(expected_batch.schema()->Equals(*schema_result));

Review comment:
   Thank you for your comment. You are right. We should keep the original 
schema until `ReadRecordBatch`.   
   Then, if `LoadRecordBatchSubset` will automatically convert the endian based 
on the schema's endianness, `LoadRecordBatchSubset` will update the endian flag 
in the schema.
   
   I will move this assertion after `Do*RoundTrip`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


kou closed pull request #7509:
URL: https://github.com/apache/arrow/pull/7509


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


kou commented on pull request #7509:
URL: https://github.com/apache/arrow/pull/7509#issuecomment-647843737


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7509:
URL: https://github.com/apache/arrow/pull/7509#issuecomment-647836070


   Revision: cb16452e499fb61e6ec231fed4c2ddc7fe88d11b
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-360](https://github.com/ursa-labs/crossbow/branches/all?query=actions-360)
   
   |Task|Status|
   ||--|
   
|debian-buster-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-360-travis-debian-buster-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox


kou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-647834815


   The change doesn't add `UTF8PROC_STATIC` definition...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-22 Thread GitBox


kou commented on pull request #7452:
URL: https://github.com/apache/arrow/pull/7452#issuecomment-647834442


   TODO:
   
 * Need the `UTF8PROC_STATIC` definition with bundled utf8proc
 * Need to add include path for `utf8proc.h`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


kou commented on pull request #7509:
URL: https://github.com/apache/arrow/pull/7509#issuecomment-647833733


   @github-actions crossbow submit debian-buster-arm64



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647825256


   FWIW on the "gcc/clang perf discussion", clang also shows performance 
benefits and limited downside
   
   ```
 benchmark baselinecontender  change %  
regression
   2UniqueInt64/116.444 GiB/sec   18.511 GiB/sec   187.240  
 False
   31UniqueInt64/56.470 GiB/sec   18.390 GiB/sec   184.244  
 False
   39UniqueUInt8/5  810.180 MiB/sec1.747 GiB/sec   120.867  
 False
   26UniqueUInt8/1  683.475 MiB/sec1.430 GiB/sec   114.196  
 False
   42UniqueInt64/45.424 GiB/sec6.965 GiB/sec28.397  
 False
   18UniqueInt64/12.672 GiB/sec3.411 GiB/sec27.627  
 False
   40UniqueUInt8/2  654.320 MiB/sec  826.916 MiB/sec26.378  
 False
   33UniqueUInt8/4  758.115 MiB/sec  947.360 MiB/sec24.962  
 False
   25   UniqueInt64/105.248 GiB/sec6.426 GiB/sec22.460  
 False
   9UniqueString100bytes/5   26.923 GiB/sec   32.142 GiB/sec19.384  
 False
   35   UniqueString10bytes/112.691 GiB/sec3.207 GiB/sec19.173  
 False
   3 UniqueString10bytes/52.695 GiB/sec3.200 GiB/sec18.731  
 False
   20  UniqueString100bytes/11   26.909 GiB/sec   31.831 GiB/sec18.291  
 False
   30UniqueInt64/72.514 GiB/sec2.890 GiB/sec14.960  
 False
   37UniqueInt64/22.619 GiB/sec2.975 GiB/sec13.578  
 False
   11UniqueString10bytes/42.487 GiB/sec2.700 GiB/sec 8.596  
 False
   32   UniqueString10bytes/102.386 GiB/sec2.589 GiB/sec 8.481  
 False
   0UniqueString100bytes/4   24.419 GiB/sec   26.365 GiB/sec 7.966  
 False
   38  UniqueString100bytes/10   22.463 GiB/sec   24.128 GiB/sec 7.411  
 False
   34UniqueInt64/82.392 GiB/sec2.563 GiB/sec 7.157  
 False
   19UniqueString10bytes/1  781.817 MiB/sec  835.760 MiB/sec 6.900  
 False
   43UniqueInt64/32.184 GiB/sec2.331 GiB/sec 6.721  
 False
   24UniqueString10bytes/7  583.523 MiB/sec  621.007 MiB/sec 6.424  
 False
   15   UniqueString100bytes/71.936 GiB/sec2.024 GiB/sec 4.538  
 False
   6 UniqueString10bytes/2  780.337 MiB/sec  805.686 MiB/sec 3.248  
 False
   27   UniqueString100bytes/23.934 GiB/sec4.059 GiB/sec 3.197  
 False
   13   UniqueString100bytes/13.898 GiB/sec3.995 GiB/sec 2.485  
 False
   7 UniqueString10bytes/8  592.115 MiB/sec  604.865 MiB/sec 2.153  
 False
   29   UniqueString100bytes/81.969 GiB/sec2.011 GiB/sec 2.111  
 False
   21UniqueInt64/92.034 GiB/sec2.048 GiB/sec 0.676  
 False
   1 BuildStringDictionary   85.937 MiB/sec   85.928 MiB/sec-0.010  
 False
   41UniqueUInt8/3  449.171 MiB/sec  448.844 MiB/sec-0.073  
 False
   28   UniqueString100bytes/04.084 GiB/sec4.077 GiB/sec-0.161  
 False
   4UniqueString100bytes/34.255 GiB/sec4.235 GiB/sec-0.450  
 False
   5UniqueString100bytes/62.054 GiB/sec2.033 GiB/sec-1.041  
 False
   14   UniqueString100bytes/92.138 GiB/sec2.107 GiB/sec-1.449  
 False
   8 UniqueUInt8/01.777 GiB/sec1.750 GiB/sec-1.487  
 False
   23UniqueInt64/03.860 GiB/sec3.799 GiB/sec-1.560  
 False
   10UniqueString10bytes/9  616.458 MiB/sec  605.470 MiB/sec-1.782  
 False
   22UniqueString10bytes/3  799.494 MiB/sec  783.825 MiB/sec-1.960  
 False
   17UniqueString10bytes/6  647.921 MiB/sec  631.631 MiB/sec-2.514  
 False
   36  BuildDictionary1.539 GiB/sec1.498 GiB/sec-2.694  
 False
   16UniqueInt64/63.193 GiB/sec3.077 GiB/sec-3.634  
 False
   12UniqueString10bytes/0  881.975 MiB/sec  839.487 MiB/sec-4.817  
 False
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647825256


   FWIW on the "gcc/clang perf discussion", clang-8 also shows performance 
benefits and limited downside
   
   ```
 benchmark baselinecontender  change %  
regression
   2UniqueInt64/116.444 GiB/sec   18.511 GiB/sec   187.240  
 False
   31UniqueInt64/56.470 GiB/sec   18.390 GiB/sec   184.244  
 False
   39UniqueUInt8/5  810.180 MiB/sec1.747 GiB/sec   120.867  
 False
   26UniqueUInt8/1  683.475 MiB/sec1.430 GiB/sec   114.196  
 False
   42UniqueInt64/45.424 GiB/sec6.965 GiB/sec28.397  
 False
   18UniqueInt64/12.672 GiB/sec3.411 GiB/sec27.627  
 False
   40UniqueUInt8/2  654.320 MiB/sec  826.916 MiB/sec26.378  
 False
   33UniqueUInt8/4  758.115 MiB/sec  947.360 MiB/sec24.962  
 False
   25   UniqueInt64/105.248 GiB/sec6.426 GiB/sec22.460  
 False
   9UniqueString100bytes/5   26.923 GiB/sec   32.142 GiB/sec19.384  
 False
   35   UniqueString10bytes/112.691 GiB/sec3.207 GiB/sec19.173  
 False
   3 UniqueString10bytes/52.695 GiB/sec3.200 GiB/sec18.731  
 False
   20  UniqueString100bytes/11   26.909 GiB/sec   31.831 GiB/sec18.291  
 False
   30UniqueInt64/72.514 GiB/sec2.890 GiB/sec14.960  
 False
   37UniqueInt64/22.619 GiB/sec2.975 GiB/sec13.578  
 False
   11UniqueString10bytes/42.487 GiB/sec2.700 GiB/sec 8.596  
 False
   32   UniqueString10bytes/102.386 GiB/sec2.589 GiB/sec 8.481  
 False
   0UniqueString100bytes/4   24.419 GiB/sec   26.365 GiB/sec 7.966  
 False
   38  UniqueString100bytes/10   22.463 GiB/sec   24.128 GiB/sec 7.411  
 False
   34UniqueInt64/82.392 GiB/sec2.563 GiB/sec 7.157  
 False
   19UniqueString10bytes/1  781.817 MiB/sec  835.760 MiB/sec 6.900  
 False
   43UniqueInt64/32.184 GiB/sec2.331 GiB/sec 6.721  
 False
   24UniqueString10bytes/7  583.523 MiB/sec  621.007 MiB/sec 6.424  
 False
   15   UniqueString100bytes/71.936 GiB/sec2.024 GiB/sec 4.538  
 False
   6 UniqueString10bytes/2  780.337 MiB/sec  805.686 MiB/sec 3.248  
 False
   27   UniqueString100bytes/23.934 GiB/sec4.059 GiB/sec 3.197  
 False
   13   UniqueString100bytes/13.898 GiB/sec3.995 GiB/sec 2.485  
 False
   7 UniqueString10bytes/8  592.115 MiB/sec  604.865 MiB/sec 2.153  
 False
   29   UniqueString100bytes/81.969 GiB/sec2.011 GiB/sec 2.111  
 False
   21UniqueInt64/92.034 GiB/sec2.048 GiB/sec 0.676  
 False
   1 BuildStringDictionary   85.937 MiB/sec   85.928 MiB/sec-0.010  
 False
   41UniqueUInt8/3  449.171 MiB/sec  448.844 MiB/sec-0.073  
 False
   28   UniqueString100bytes/04.084 GiB/sec4.077 GiB/sec-0.161  
 False
   4UniqueString100bytes/34.255 GiB/sec4.235 GiB/sec-0.450  
 False
   5UniqueString100bytes/62.054 GiB/sec2.033 GiB/sec-1.041  
 False
   14   UniqueString100bytes/92.138 GiB/sec2.107 GiB/sec-1.449  
 False
   8 UniqueUInt8/01.777 GiB/sec1.750 GiB/sec-1.487  
 False
   23UniqueInt64/03.860 GiB/sec3.799 GiB/sec-1.560  
 False
   10UniqueString10bytes/9  616.458 MiB/sec  605.470 MiB/sec-1.782  
 False
   22UniqueString10bytes/3  799.494 MiB/sec  783.825 MiB/sec-1.960  
 False
   17UniqueString10bytes/6  647.921 MiB/sec  631.631 MiB/sec-2.514  
 False
   36  BuildDictionary1.539 GiB/sec1.498 GiB/sec-2.694  
 False
   16UniqueInt64/63.193 GiB/sec3.077 GiB/sec-3.634  
 False
   12UniqueString10bytes/0  881.975 MiB/sec  839.487 MiB/sec-4.817  
 False
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647823397


   Also on the binary size, these changes add about 75KB to libarrow.so. My 
guess is the difference is mostly coming from code inlining for the all-null 
case (which wasn't split out before)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647821487


   Here's a benchmark run with gcc-8 
   
   ```
   
---
   BenchmarkTime   CPU Iterations 
UserCounters...
   
---
   BuildDictionary2625315 ns2625247 ns271 
null_percent=0.91.4865GB/s
   BuildStringDictionary  3475855 ns3475854 ns200   86.8577MB/s
   UniqueInt64/0  9842842 ns9842834 ns 71 
null_percent=0 num_unique=10243.1749GB/s
   UniqueInt64/1 10617685 ns   10617360 ns 66 
null_percent=0.1 num_unique=1024   2.94329GB/s
   UniqueInt64/2 12648447 ns   12648430 ns 59 
null_percent=1 num_unique=1024   2.47066GB/s
   UniqueInt64/3 15365202 ns   15365113 ns 43 
null_percent=10 num_unique=1024   2.03383GB/s
   UniqueInt64/4  5126936 ns5126851 ns128 
null_percent=99 num_unique=1024   6.09536GB/s
   UniqueInt64/5  1763829 ns1763809 ns400 
null_percent=100 num_unique=1024   17.7173GB/s
   UniqueInt64/6 10545960 ns   10545841 ns 67 
null_percent=0 num_unique=10.24k   2.96325GB/s
   UniqueInt64/7 11478529 ns   11478403 ns 61 
null_percent=0.1 num_unique=10.24k2.7225GB/s
   UniqueInt64/8 12792912 ns   12792429 ns 54 
null_percent=1 num_unique=10.24k   2.44285GB/s
   UniqueInt64/9 16805938 ns   16805535 ns 44 
null_percent=10 num_unique=10.24k   1.85951GB/s
   UniqueInt64/10 5503266 ns5503108 ns114 
null_percent=99 num_unique=10.24k   5.67861GB/s
   UniqueInt64/11 1763742 ns1763699 ns392 
null_percent=100 num_unique=10.24k   17.7184GB/s
   UniqueString10bytes/0 44193582 ns   44191679 ns 16 
null_percent=0 num_unique=1024   905.148MB/s
   UniqueString10bytes/1 45022703 ns   45022263 ns 15 
null_percent=0.1 num_unique=1024   888.449MB/s
   UniqueString10bytes/2 47131705 ns   47130800 ns 15 
null_percent=1 num_unique=1024   848.702MB/s
   UniqueString10bytes/3 50106213 ns   50105455 ns 14 
null_percent=10 num_unique=1024   798.316MB/s
   UniqueString10bytes/4 15905586 ns   15905158 ns 43 
null_percent=99 num_unique=1024   2.45596GB/s
   UniqueString10bytes/5 12983446 ns   12983327 ns 55 
null_percent=100 num_unique=1024   3.00867GB/s
   UniqueString10bytes/6 62149404 ns   62148971 ns 11 
null_percent=0 num_unique=10.24k   643.615MB/s
   UniqueString10bytes/7 62707969 ns   62705282 ns 11 
null_percent=0.1 num_unique=10.24k   637.905MB/s
   UniqueString10bytes/8 65508665 ns   65508532 ns 10 
null_percent=1 num_unique=10.24k   610.607MB/s
   UniqueString10bytes/9 65766803 ns   65766094 ns 11 
null_percent=10 num_unique=10.24k   608.216MB/s
   UniqueString10bytes/1016297990 ns   16298076 ns 43 
null_percent=99 num_unique=10.24k   2.39676GB/s
   UniqueString10bytes/1113298987 ns   13298798 ns 54 
null_percent=100 num_unique=10.24k2.9373GB/s
   UniqueString100bytes/094204048 ns   94200614 ns  7 
null_percent=0 num_unique=1024   4.14674GB/s
   UniqueString100bytes/195631478 ns   95630838 ns  7 
null_percent=0.1 num_unique=1024   4.08472GB/s
   UniqueString100bytes/296547756 ns   96546348 ns  7 
null_percent=1 num_unique=1024   4.04598GB/s
   UniqueString100bytes/391950796 ns   91949032 ns  8 
null_percent=10 num_unique=1024   4.24828GB/s
   UniqueString100bytes/417292562 ns   17291979 ns 42 
null_percent=99 num_unique=1024 22.59GB/s
   UniqueString100bytes/513096944 ns   13096809 ns 55 
null_percent=100 num_unique=102429.826GB/s
   UniqueString100bytes/6   196165738 ns  196161451 ns  4 
null_percent=0 num_unique=10.24k   1.99134GB/s
   UniqueString100bytes/7   198475556 ns  198475456 ns  4 
null_percent=0.1 num_unique=10.24k   1.96813GB/s
   UniqueString100bytes/8   199273625 ns  199270358 ns  3 
null_percent=1 num_unique=10.24k   1.96028GB/s
   UniqueString100bytes/9   189235180 ns  189232925 ns  4 
null_percent=10 num_unique=10.24k   2.06425GB/s
   UniqueString100bytes/10   18381309 ns   18381409 ns 36 
null_percent=99 num_unique=10.24k   21.2511GB/s
   UniqueString100bytes/11   13426102 ns   13426072 ns 51 
null_percent=100 num_unique=10.24k   29.0945GB/s
   UniqueUInt8/0  2239549 ns2239561 ns309 
null_percent=0 num_unique=2001.7442GB/s
   UniqueUInt8/1  2687371 ns2687349 ns248 
null_percent=0.1 num_unique=200   1.45357GB/s
   UniqueUInt8/2  4244052 ns4244058 ns166 

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647821487


   Here's a benchmark run with gcc-8 
   
   ```
   ---
   BenchmarkTime   CPU Iterations
   ---
   BuildDictionary3219443 ns3219440 ns218   1.21215GB/s
   BuildStringDictionary  3692881 ns3692881 ns192   81.7532MB/s
   UniqueInt64/0 14413456 ns   14413251 ns 48 
null_percent=0   2.16814GB/s
   UniqueInt64/1 15516052 ns   15515737 ns 45 
null_percent=0.1   2.01408GB/s
   UniqueInt64/2 17031282 ns   17031266 ns 41 
null_percent=1   1.83486GB/s
   UniqueInt64/3 20680114 ns   20680064 ns 34 
null_percent=10   1.51112GB/s
   UniqueInt64/4 12018069 ns   12017844 ns 57 
null_percent=992.6003GB/s
   UniqueInt64/5  9179953 ns9179946 ns 77 
null_percent=100   3.40416GB/s
   UniqueInt64/6 15501523 ns   15501496 ns 45 
null_percent=0   2.01593GB/s
   UniqueInt64/7 16482935 ns   16482300 ns 41 
null_percent=0.1   1.89597GB/s
   UniqueInt64/8 18349988 ns   18349317 ns 38 
null_percent=1   1.70306GB/s
   UniqueInt64/9 21439268 ns   21439244 ns 32 
null_percent=10   1.45761GB/s
   UniqueInt64/1012530067 ns   12529871 ns 55 
null_percent=99   2.49404GB/s
   UniqueInt64/11 9167314 ns9167365 ns 75 
null_percent=100   3.40883GB/s
   UniqueString10bytes/0 43535899 ns   43535846 ns 16 
null_percent=0   918.783MB/s
   UniqueString10bytes/1 45130595 ns   45129634 ns 16 
null_percent=0.1   886.336MB/s
   UniqueString10bytes/2 45249034 ns   45247983 ns 15 
null_percent=1   884.017MB/s
   UniqueString10bytes/3 45101533 ns   45100209 ns 16 
null_percent=10   886.914MB/s
   UniqueString10bytes/4  4316048 ns4316019 ns163 
null_percent=99   9.05059GB/s
   UniqueString10bytes/5  1435781 ns1435763 ns485 
null_percent=100   27.2068GB/s
   UniqueString10bytes/6 59100344 ns   59098817 ns 12 
null_percent=0   676.832MB/s
   UniqueString10bytes/7 59797544 ns   59795857 ns 12 
null_percent=0.1   668.943MB/s
   UniqueString10bytes/8 61024697 ns   61023090 ns 11 
null_percent=1655.49MB/s
   UniqueString10bytes/9 59817211 ns   59816339 ns 12 
null_percent=10   668.714MB/s
   UniqueString10bytes/10 4950387 ns4950242 ns134 
null_percent=99   7.89103GB/s
   UniqueString10bytes/11 1443482 ns1443434 ns446 
null_percent=100   27.0622GB/s
   UniqueString100bytes/095609006 ns   95606132 ns  7 
null_percent=0   4.08577GB/s
   UniqueString100bytes/196850582 ns   96849441 ns  7 
null_percent=0.1   4.03332GB/s
   UniqueString100bytes/295404742 ns   95404634 ns  7 
null_percent=14.0944GB/s
   UniqueString100bytes/389401775 ns   89401006 ns  8 
null_percent=10   4.36936GB/s
   UniqueString100bytes/4 4705868 ns4705746 ns148 
null_percent=99   83.0102GB/s
   UniqueString100bytes/5 1434077 ns1434055 ns486 
null_percent=100   272.392GB/s
   UniqueString100bytes/6   206155133 ns  206148425 ns  3 
null_percent=0   1.89487GB/s
   UniqueString100bytes/7   204661287 ns  204653659 ns  3 
null_percent=0.1   1.90871GB/s
   UniqueString100bytes/8   205941884 ns  205941271 ns  3 
null_percent=1   1.89678GB/s
   UniqueString100bytes/9   192074501 ns  192073431 ns  4 
null_percent=10   2.03373GB/s
   UniqueString100bytes/106180349 ns6180227 ns111 
null_percent=99   63.2056GB/s
   UniqueString100bytes/111474565 ns1474564 ns482 
null_percent=100   264.909GB/s
   UniqueUInt8/0  1990025 ns1990023 ns348 
null_percent=0   1.96292GB/s
   UniqueUInt8/1  2594146 ns2594089 ns272 
null_percent=0.1   1.50583GB/s
   UniqueUInt8/2  4726027 ns4726053 ns145 
null_percent=1   846.372MB/s
   UniqueUInt8/3  9465222 ns9465126 ns 75 
null_percent=10   422.604MB/s
   UniqueUInt8/4  3557141 ns3557135 ns195 
null_percent=991124.5MB/s
   UniqueUInt8/5  2259664 ns2259664 ns314 
null_percent=100   1.72869GB/s
   ```
   
   (I need to add "num_unique" to `state.counters` -- there are two different 
cardinality cases represented here)
   
   Here is the % diff versus the baseline. 
   
   * Cases 1 and 7 are the mostly-not-null cases. This shows a 15-20% perf 
improvement
   * Cases 5 and 11 are the all-null cases.
   * Case 4 and 10 are the 99% null cases
   * The "BuildDictionary" case at the 

[GitHub] [arrow] wesm edited a comment on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647821487


   Here's a benchmark run with gcc-8 
   
   ```
   ---
   BenchmarkTime   CPU Iterations
   ---
   BuildDictionary3219443 ns3219440 ns218   1.21215GB/s
   BuildStringDictionary  3692881 ns3692881 ns192   81.7532MB/s
   UniqueInt64/0 14413456 ns   14413251 ns 48 
null_percent=0   2.16814GB/s
   UniqueInt64/1 15516052 ns   15515737 ns 45 
null_percent=0.1   2.01408GB/s
   UniqueInt64/2 17031282 ns   17031266 ns 41 
null_percent=1   1.83486GB/s
   UniqueInt64/3 20680114 ns   20680064 ns 34 
null_percent=10   1.51112GB/s
   UniqueInt64/4 12018069 ns   12017844 ns 57 
null_percent=992.6003GB/s
   UniqueInt64/5  9179953 ns9179946 ns 77 
null_percent=100   3.40416GB/s
   UniqueInt64/6 15501523 ns   15501496 ns 45 
null_percent=0   2.01593GB/s
   UniqueInt64/7 16482935 ns   16482300 ns 41 
null_percent=0.1   1.89597GB/s
   UniqueInt64/8 18349988 ns   18349317 ns 38 
null_percent=1   1.70306GB/s
   UniqueInt64/9 21439268 ns   21439244 ns 32 
null_percent=10   1.45761GB/s
   UniqueInt64/1012530067 ns   12529871 ns 55 
null_percent=99   2.49404GB/s
   UniqueInt64/11 9167314 ns9167365 ns 75 
null_percent=100   3.40883GB/s
   UniqueString10bytes/0 43535899 ns   43535846 ns 16 
null_percent=0   918.783MB/s
   UniqueString10bytes/1 45130595 ns   45129634 ns 16 
null_percent=0.1   886.336MB/s
   UniqueString10bytes/2 45249034 ns   45247983 ns 15 
null_percent=1   884.017MB/s
   UniqueString10bytes/3 45101533 ns   45100209 ns 16 
null_percent=10   886.914MB/s
   UniqueString10bytes/4  4316048 ns4316019 ns163 
null_percent=99   9.05059GB/s
   UniqueString10bytes/5  1435781 ns1435763 ns485 
null_percent=100   27.2068GB/s
   UniqueString10bytes/6 59100344 ns   59098817 ns 12 
null_percent=0   676.832MB/s
   UniqueString10bytes/7 59797544 ns   59795857 ns 12 
null_percent=0.1   668.943MB/s
   UniqueString10bytes/8 61024697 ns   61023090 ns 11 
null_percent=1655.49MB/s
   UniqueString10bytes/9 59817211 ns   59816339 ns 12 
null_percent=10   668.714MB/s
   UniqueString10bytes/10 4950387 ns4950242 ns134 
null_percent=99   7.89103GB/s
   UniqueString10bytes/11 1443482 ns1443434 ns446 
null_percent=100   27.0622GB/s
   UniqueString100bytes/095609006 ns   95606132 ns  7 
null_percent=0   4.08577GB/s
   UniqueString100bytes/196850582 ns   96849441 ns  7 
null_percent=0.1   4.03332GB/s
   UniqueString100bytes/295404742 ns   95404634 ns  7 
null_percent=14.0944GB/s
   UniqueString100bytes/389401775 ns   89401006 ns  8 
null_percent=10   4.36936GB/s
   UniqueString100bytes/4 4705868 ns4705746 ns148 
null_percent=99   83.0102GB/s
   UniqueString100bytes/5 1434077 ns1434055 ns486 
null_percent=100   272.392GB/s
   UniqueString100bytes/6   206155133 ns  206148425 ns  3 
null_percent=0   1.89487GB/s
   UniqueString100bytes/7   204661287 ns  204653659 ns  3 
null_percent=0.1   1.90871GB/s
   UniqueString100bytes/8   205941884 ns  205941271 ns  3 
null_percent=1   1.89678GB/s
   UniqueString100bytes/9   192074501 ns  192073431 ns  4 
null_percent=10   2.03373GB/s
   UniqueString100bytes/106180349 ns6180227 ns111 
null_percent=99   63.2056GB/s
   UniqueString100bytes/111474565 ns1474564 ns482 
null_percent=100   264.909GB/s
   UniqueUInt8/0  1990025 ns1990023 ns348 
null_percent=0   1.96292GB/s
   UniqueUInt8/1  2594146 ns2594089 ns272 
null_percent=0.1   1.50583GB/s
   UniqueUInt8/2  4726027 ns4726053 ns145 
null_percent=1   846.372MB/s
   UniqueUInt8/3  9465222 ns9465126 ns 75 
null_percent=10   422.604MB/s
   UniqueUInt8/4  3557141 ns3557135 ns195 
null_percent=991124.5MB/s
   UniqueUInt8/5  2259664 ns2259664 ns314 
null_percent=100   1.72869GB/s
   ```
   
   Here is the % diff versus the baseline. 
   
   * Cases 1 and 7 are the mostly-not-null cases. This shows a 15-20% perf 
improvement
   * Cases 5 and 11 are the all-null cases.
   * Case 4 and 10 are the 99% null cases
   * The "BuildDictionary" case at the bottom with the perf regression is one 
of the "worst case scenarios". 89% of the values are null and so we almost 
never 

[GitHub] [arrow] github-actions[bot] commented on pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7521:
URL: https://github.com/apache/arrow/pull/7521#issuecomment-647820320


   https://issues.apache.org/jira/browse/ARROW-9210



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7521: ARROW-9210: [C++] Use BitBlockCounter in array/visitor_inline.h

2020-06-22 Thread GitBox


wesm opened a new pull request #7521:
URL: https://github.com/apache/arrow/pull/7521


   This significantly speeds up processing of mostly-not-null or mostly-null 
data, while having almost no overhead for the other scenarios where you rarely 
have a word-sized run of all-not-null or all-null-data. For data with 
null_count 0, data is processed in blocks of INT16_MAX values at a time, so 
this adds no meaningful overhead for this case either. 
   
   I modified the hash benchmarks where this code is used to exhibit both the 
cases that benefit from this optimization as well as the ones that don't. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs edited a comment on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


kszucs edited a comment on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647798092


   Using pandas is not a problem, but the results cannot be improved much other 
than sorting the table.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


kszucs commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647798092


   Using pandas is not a problem, but other than sorting the results we cannot 
really improve the look and feel. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7520:
URL: https://github.com/apache/arrow/pull/7520#issuecomment-647796781


   https://issues.apache.org/jira/browse/ARROW-9189



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7520: ARROW-9189: [Website] Improve contributor guide

2020-06-22 Thread GitBox


nealrichardson opened a new pull request #7520:
URL: https://github.com/apache/arrow/pull/7520


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou closed pull request #7508: ARROW-9202: [GLib] Add GArrowDatum

2020-06-22 Thread GitBox


kou closed pull request #7508:
URL: https://github.com/apache/arrow/pull/7508


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7509:
URL: https://github.com/apache/arrow/pull/7509#issuecomment-647794119


   Revision: 56a5e1ce92e541cfaac827d5334cbe8981b1b74b
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-359](https://github.com/ursa-labs/crossbow/branches/all?query=actions-359)
   
   |Task|Status|
   ||--|
   
|debian-buster-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-359-travis-debian-buster-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7508: ARROW-9202: [GLib] Add GArrowDatum

2020-06-22 Thread GitBox


kou commented on pull request #7508:
URL: https://github.com/apache/arrow/pull/7508#issuecomment-647793770


   +1



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


kou commented on pull request #7509:
URL: https://github.com/apache/arrow/pull/7509#issuecomment-647793370


   @github-actions crossbow submit debian-buster-arm64



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox


kou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-647791287


   Ah, we need the `UTF8PROC_STATIC` definition.
   
   Does this work?
   
   ```diff
   diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake 
b/cpp/cmake_modules/ThirdpartyToolchain.cmake
   index 5ab29cf2c..aa2bbf0dd 100644
   --- a/cpp/cmake_modules/ThirdpartyToolchain.cmake
   +++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake
   @@ -2060,7 +2060,9 @@ macro(build_utf8proc)
  set_target_properties(utf8proc::utf8proc
PROPERTIES IMPORTED_LOCATION 
"${UTF8PROC_STATIC_LIB}"
   INTERFACE_INCLUDE_DIRECTORIES
   -   "${UTF8PROC_PREFIX}/include")
   +   "${UTF8PROC_PREFIX}/include"
   +   INTERFACE_COMPILER_DEFINITIONS
   +   "UTF8PROC_STATIC")

  add_dependencies(toolchain utf8proc_ep)
  add_dependencies(utf8proc::utf8proc utf8proc_ep)
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alippai commented on pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-22 Thread GitBox


alippai commented on pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#issuecomment-647789155


   Does "for testing and benchmarking" mean that it's not optimal to store 
arrow/parquet files on Minio for production workloads?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on a change in pull request #7507: ARROW-8797: [C++] [WIP] Create test to receive RecordBatch for different endian

2020-06-22 Thread GitBox


kou commented on a change in pull request #7507:
URL: https://github.com/apache/arrow/pull/7507#discussion_r443832986



##
File path: cpp/src/arrow/ipc/read_write_test.cc
##
@@ -409,6 +414,32 @@ class IpcTestFixture : public io::MemoryMapFixture, public 
ExtensionTypesMixin {
 CheckRoundtrip(*batch, options, IpcReadOptions::Defaults(), buffer_size);
   }
 
+  void CheckRoundTrip(const RecordBatch& src_batch,
+  const RecordBatch& expected_batch,
+  IpcWriteOptions options = IpcWriteOptions::Defaults(),
+  IpcReadOptions read_options = IpcReadOptions::Defaults(),
+  int64_t buffer_size = 1 << 20) {
+std::stringstream ss;
+ss << "test-write-row-batch-" << g_file_number++;
+ASSERT_OK_AND_ASSIGN(mmap_,
+ io::MemoryMapFixture::InitMemoryMap(buffer_size, 
ss.str()));
+
+DictionaryMemo dictionary_memo;
+
+std::shared_ptr schema_result;
+DoSchemaRoundTrip(*src_batch.schema(), _memo, _result);
+ASSERT_TRUE(expected_batch.schema()->Equals(*schema_result));

Review comment:
   Should we always convert schema's endianness to native endian?
   I think that we should keep the original schema's endianness for 
serialization.
   Did we discuss this?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm edited a comment on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


wesm edited a comment on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647758158


   I’m sort of -1 on using anything but pandas for data munging and data 
presentation in our tooling. It’s not a very large dependency and has 
everything we need. FWIW, the current Ursabot output doesn't even sort the 
results, which is really needed to easily make sense of what got faster or 
slower at a glance. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7452: ARROW-8961: [C++] Add utf8proc library to toolchain

2020-06-22 Thread GitBox


wesm commented on pull request #7452:
URL: https://github.com/apache/arrow/pull/7452#issuecomment-647763853


   @xhochy I'm fine with merging this -- it's a non-mandatory dependency right 
(sorry have not reviewed the patch yet and the PR description does not say)?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7509:
URL: https://github.com/apache/arrow/pull/7509#issuecomment-647759816


   Revision: 77bc06d2ff96b41dd1e37838601f92fd0e95e7e2
   
   Submitted crossbow builds: [ursa-labs/crossbow @ 
actions-358](https://github.com/ursa-labs/crossbow/branches/all?query=actions-358)
   
   |Task|Status|
   ||--|
   
|debian-buster-arm64|[![TravisCI](https://img.shields.io/travis/ursa-labs/crossbow/actions-358-travis-debian-buster-arm64.svg)](https://travis-ci.org/ursa-labs/crossbow/branches)|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kou commented on pull request #7509: ARROW-9203: [Packaging][deb] Add missing gir1.2-arrow-dataset-1.0.install

2020-06-22 Thread GitBox


kou commented on pull request #7509:
URL: https://github.com/apache/arrow/pull/7509#issuecomment-647758878


   @github-actions crossbow submit debian-buster-arm64



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


wesm commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647758158


   I’m sort of -1 on using anything but pandas for data munging and data 
presentation in our tooling. It’s not a very large dependency and has 
everything we need. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings [WIP]

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7519:
URL: https://github.com/apache/arrow/pull/7519#issuecomment-647754455


   https://issues.apache.org/jira/browse/ARROW-9153



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs opened a new pull request #7519: ARROW-9153: [C++][Python] Refactor scalar bindings [WIP]

2020-06-22 Thread GitBox


kszucs opened a new pull request #7519:
URL: https://github.com/apache/arrow/pull/7519


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] maartenbreddels commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

2020-06-22 Thread GitBox


maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-647711867


   @xhochy I see at least an `undefined reference to `_imp__utf8proc_toupper'` 
in the MinGW builds, hope that helps.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7518: ARROW-9138: [Docs][Format] Make sure format version is hard coded in the docs

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7518:
URL: https://github.com/apache/arrow/pull/7518#issuecomment-647699885


   https://issues.apache.org/jira/browse/ARROW-9138



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7517: ARROW-1682: [Doc] Expand S3/MinIO fileystem dataset documentation

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7517:
URL: https://github.com/apache/arrow/pull/7517#issuecomment-647699887


   https://issues.apache.org/jira/browse/ARROW-1682



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


fsaintjacques commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647698721


   ursabot uses `tabulate` which I think is smaller dependencies.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-647699240


   Here's with clang, even less of an issue there it seems
   
   ```
   benchmark baselinecontender  
change %  regression
   32  BM_ReadColumn/1/11.656 GiB/sec2.689 GiB/sec  
  62.384   False
   15 BM_ReadColumn/99/02.861 GiB/sec4.184 GiB/sec  
  46.245   False
   21BM_ReadColumn/99/501.890 GiB/sec2.744 GiB/sec  
  45.220   False
   20BM_ReadColumn/99/502.898 GiB/sec4.199 GiB/sec  
  44.891   False
   6  BM_ReadColumn/50/11.530 GiB/sec2.210 GiB/sec  
  44.402   False
   30BM_ReadColumn/50/501.586 GiB/sec2.233 GiB/sec  
  40.787   False
   28 BM_ReadColumn/50/0  881.102 MiB/sec1.196 GiB/sec  
  39.025   False
   22BM_ReadColumn/45/251.598 GiB/sec2.221 GiB/sec  
  38.951   False
   14  BM_ReadColumn/1/12.967 GiB/sec4.122 GiB/sec  
  38.919   False
   29BM_ReadColumn/50/50  887.206 MiB/sec1.203 GiB/sec  
  38.854   False
   35 BM_ReadColumn/99/01.928 GiB/sec2.661 GiB/sec  
  38.045   False
   17BM_ReadColumn/35/101.656 GiB/sec2.248 GiB/sec  
  35.689   False
   0BM_ReadColumn/-1/502.765 GiB/sec3.750 GiB/sec  
  35.628   False
   26BM_ReadColumn/30/101.692 GiB/sec2.221 GiB/sec  
  31.244   False
   19 BM_ReadColumn/75/11.712 GiB/sec2.217 GiB/sec  
  29.505   False
   2  BM_ReadColumn/25/5  991.022 MiB/sec1.229 GiB/sec  
  27.031   False
   31BM_ReadColumn/25/101.744 GiB/sec2.164 GiB/sec  
  24.091   False
   16   BM_ReadColumn/-1/505.892 GiB/sec7.252 GiB/sec  
  23.084   False
   24BM_ReadColumn/10/101.003 GiB/sec1.225 GiB/sec  
  22.073   False
   23   BM_ReadColumn/25/251.790 GiB/sec2.159 GiB/sec  
  20.659   False
   10  BM_ReadColumn/5/51.995 GiB/sec2.371 GiB/sec  
  18.853   False
   5  BM_ReadColumn/10/51.796 GiB/sec2.122 GiB/sec  
  18.128   False
   3   BM_ReadColumn/-1/203.488 GiB/sec4.062 GiB/sec  
  16.453   False
   1BM_ReadColumn/-1/102.740 GiB/sec3.055 GiB/sec  
  11.497   False
   11   BM_ReadColumn/-1/101.362 GiB/sec1.486 GiB/sec  
   9.086   False
   9BM_ReadColumn/-1/0   12.984 GiB/sec   13.310 GiB/sec  
   2.509   False
   13   BM_ReadColumn/5/10  515.781 MiB/sec  523.838 MiB/sec  
   1.562   False
   7 BM_ReadColumn/-1/14.505 GiB/sec4.559 GiB/sec  
   1.198   False
   34  BM_ReadColumn/1/20  400.254 MiB/sec  403.780 MiB/sec  
   0.881   False
   25  BM_ReadColumn/-1/0  870.109 MiB/sec  875.827 MiB/sec  
   0.657   False
   18BM_ReadColumn/-1/18.237 GiB/sec8.045 GiB/sec  
  -2.331   False
   27   BM_ReadColumn/-1/1  816.651 MiB/sec  793.763 MiB/sec  
  -2.803   False
   8BM_ReadColumn/10/501.649 GiB/sec1.587 GiB/sec  
  -3.792   False
   12 BM_ReadColumn/-1/03.044 GiB/sec1.432 GiB/sec  
 -52.958True
   33BM_ReadColumn/-1/03.161 GiB/sec1.479 GiB/sec  
 -53.209True
   4  BM_ReadColumn/-1/01.863 GiB/sec  754.619 MiB/sec  
 -60.448True
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm closed pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox


wesm closed pull request #7506:
URL: https://github.com/apache/arrow/pull/7506


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-647694302


   Here's what I see with gcc-8:
   
   ```
   benchmark  baseline 
contender  change %  regression
   1   BM_ReadColumn/1/1 1.336 GiB/sec 1.880 
GiB/sec40.701   False
   17  BM_ReadColumn/1/1 2.224 GiB/sec 2.926 
GiB/sec31.567   False
   14 BM_ReadColumn/99/0 1.599 GiB/sec 2.001 
GiB/sec25.128   False
   23BM_ReadColumn/99/50 1.601 GiB/sec 1.990 
GiB/sec24.300   False
   21BM_ReadColumn/50/50  1002.325 MiB/sec 1.206 
GiB/sec23.171   False
   12BM_ReadColumn/50/50   619.919 MiB/sec   758.845 
MiB/sec22.410   False
   33BM_ReadColumn/45/25  1016.908 MiB/sec 1.212 
GiB/sec22.095   False
   30 BM_ReadColumn/50/0   620.068 MiB/sec   757.008 
MiB/sec22.085   False
   34  BM_ReadColumn/5/5 1.586 GiB/sec 1.935 
GiB/sec21.981   False
   27 BM_ReadColumn/99/0 2.539 GiB/sec 3.081 
GiB/sec21.308   False
   13 BM_ReadColumn/10/5 1.361 GiB/sec 1.644 
GiB/sec20.791   False
   18 BM_ReadColumn/50/1 1.002 GiB/sec 1.204 
GiB/sec20.174   False
   26 BM_ReadColumn/75/1 1.182 GiB/sec 1.419 
GiB/sec20.005   False
   15BM_ReadColumn/99/50 2.532 GiB/sec 3.036 
GiB/sec19.890   False
   11BM_ReadColumn/30/10 1.123 GiB/sec 1.322 
GiB/sec17.754   False
   35BM_ReadColumn/25/10 1.187 GiB/sec 1.396 
GiB/sec17.558   False
   25 BM_ReadColumn/25/5   741.612 MiB/sec   870.530 
MiB/sec17.384   False
   3 BM_ReadColumn/10/10   859.032 MiB/sec  1005.688 
MiB/sec17.072   False
   31BM_ReadColumn/35/10 1.095 GiB/sec 1.254 
GiB/sec14.528   False
   20   BM_ReadColumn/-1/1   712.571 MiB/sec   808.480 
MiB/sec13.460   False
   5BM_ReadColumn/25/25 1.244 GiB/sec 1.399 
GiB/sec12.397   False
   9BM_ReadColumn/5/10   460.741 MiB/sec   484.654 
MiB/sec 5.190   False
   7   BM_ReadColumn/1/20   397.103 MiB/sec   412.517 
MiB/sec 3.882   False
   29  BM_ReadColumn/-1/0   984.767 MiB/sec   994.483 
MiB/sec 0.987   False
   19   BM_ReadColumn/-1/013.609 GiB/sec13.738 
GiB/sec 0.944   False
   28BM_ReadColumn/-1/1 4.910 GiB/sec 4.702 
GiB/sec-4.236   False
   8 BM_ReadColumn/-1/1 8.631 GiB/sec 8.000 
GiB/sec-7.310True
   6BM_ReadColumn/-1/10 2.872 GiB/sec 2.329 
GiB/sec   -18.912True
   32   BM_ReadColumn/-1/10 1.457 GiB/sec 1.099 
GiB/sec   -24.615True
   22   BM_ReadColumn/10/50 1.433 GiB/sec   964.501 
MiB/sec   -34.267True
   24 BM_ReadColumn/-1/0 2.325 GiB/sec 1.400 
GiB/sec   -39.804True
   2 BM_ReadColumn/-1/0 2.412 GiB/sec 1.422 
GiB/sec   -41.059True
   10 BM_ReadColumn/-1/0 1.454 GiB/sec   864.019 
MiB/sec   -41.959True
   16  BM_ReadColumn/-1/20 3.826 GiB/sec 2.184 
GiB/sec   -42.899True
   0BM_ReadColumn/-1/50 4.669 GiB/sec 1.865 
GiB/sec   -60.055True
   4BM_ReadColumn/-1/50 3.184 GiB/sec   862.581 
MiB/sec   -73.546True
   ```
   
   I agree that the alternating NA/not-NA is pretty pathological so IMHO this 
performance regression doesn't seem like a big deal to me. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7504: ARROW-9193: [C++] Avoid spurious intermediate string copy in ToDateHolder

2020-06-22 Thread GitBox


pitrou closed pull request #7504:
URL: https://github.com/apache/arrow/pull/7504


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson opened a new pull request #7518: ARROW-9138: [Docs][Format] Make sure format version is hard coded in the docs

2020-06-22 Thread GitBox


nealrichardson opened a new pull request #7518:
URL: https://github.com/apache/arrow/pull/7518


   Since after 1.0, the library versions will continue to increase but the 
format will remain at 1.0 until it is modified, we need to version the 
specification docs separately from the library version that shows at the top of 
the sidebar.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] projjal commented on pull request #7504: ARROW-9193: [C++] Avoid spurious intermediate string copy in ToDateHolder

2020-06-22 Thread GitBox


projjal commented on pull request #7504:
URL: https://github.com/apache/arrow/pull/7504#issuecomment-647691097


   > Any reason not to use a `util::string_view`?
   
   Since the toDate function here takes a char array and just passes it onto 
the parseTimestamp method, using a util::string_view is probably not required.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-647690792


   I rebased. I'm going to look at the benchmarks locally and then probably 
merge this so further performance work (including comparing with 
BitBlockCounter) can be pursued as follow up



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7435: ARROW-8779: [R] Implement conversion to List

2020-06-22 Thread GitBox


nealrichardson commented on a change in pull request #7435:
URL: https://github.com/apache/arrow/pull/7435#discussion_r443735874



##
File path: r/src/array_from_vector.cpp
##
@@ -201,6 +202,67 @@ struct VectorToArrayConverter {
 return Status::OK();
   }
 
+  template 
+  arrow::enable_if_t::value, Status> Visit(const T& type) {
+using BuilderType = typename TypeTraits::BuilderType;
+ARROW_RETURN_IF(!Rf_inherits(x, "data.frame"),
+Status::RError("Expecting a data frame"));
+
+auto* struct_builder = checked_cast(builder);
+
+int64_t n = vctrs::short_vec_size(x);

Review comment:
   See also https://jira.apache.org/jira/browse/ARROW-9028 and some skipped 
tests about tables with 0 columns





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7514: ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-22 Thread GitBox


nealrichardson commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r443733083



##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {

Review comment:
   is this the same as `Rf_isNull()` (which I think we use more generally)? 

##
File path: r/src/array_from_vector.cpp
##
@@ -1067,12 +1110,22 @@ std::shared_ptr 
InferArrowTypeFromVector(SEXP x) {
   if (Rf_inherits(x, "data.frame")) {
 return InferArrowTypeFromDataFrame(x);
   } else {
-if (XLENGTH(x) == 0) {
-  Rcpp::stop(
-  "Requires at least one element to infer the values' type of a list 
vector");
-}
+SEXP ptype = Rf_getAttrib(x, symbols::ptype);
+if (ptype == R_NilValue) {
+  if (XLENGTH(x) == 0) {
+Rcpp::stop(
+"Requires at least one element to infer the values' type of a list 
vector");
+  }
 
-return arrow::list(InferArrowType(VECTOR_ELT(x, 0)));
+  return arrow::list(InferArrowType(VECTOR_ELT(x, 0)));
+} else {
+  // special case list(raw()) -> BinaryArray
+  if (TYPEOF(ptype) == RAWSXP) {
+return arrow::binary();
+  }
+
+  return arrow::list(InferArrowType(ptype));

Review comment:
   If I read this correctly, you're introducing general support for 
`vctrs::list_of ptype` here? IIUC then it would make sense for the ListArray to 
R vector conversion to set that class and attribute too then, right? (If so, 
feel free to add that here or defer to a followup)

##
File path: r/src/array_to_vector.cpp
##
@@ -693,6 +741,9 @@ std::shared_ptr Converter::Make(const 
std::shared_ptr& type
 case Type::BOOL:
   return std::make_shared(std::move(arrays));
 
+case Type::BINARY:

Review comment:
   See also https://jira.apache.org/jira/browse/ARROW-6543





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs edited a comment on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


kszucs edited a comment on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647677894


   > @kszucs can you assist me with adapting ursabot for these changes?
   
   Sure.
   
   > I think we can use pandas's `DataFrame.to_html` to create a colorized 
table for GitHub, too 
https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
   
   I'm afraid this is not going to work, because we can't embed any CSS into 
the comment, this is why we generate the ursabot responses as diffs.
   
   > 
   > Changes that would be good to have in `ursabot benchmark`:
   > 
   > * Pass through `--cc` and `--cxx` options
   > * Pass through `--repetitions`
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] kszucs commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


kszucs commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647677894


   > @kszucs can you assist me with adapting ursabot for these changes?
   Sure.
   > I think we can use pandas's `DataFrame.to_html` to create a colorized 
table for GitHub, too 
https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
   I'm afraid this is not going to work, because we can't embed any CSS into 
the comment, this is why we generate the ursabot responses as diffs.
   > 
   > Changes that would be good to have in `ursabot benchmark`:
   > 
   > * Pass through `--cc` and `--cxx` options
   > * Pass through `--repetitions`
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

2020-06-22 Thread GitBox


wesm commented on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-647674585


   @emkornfield I'm sorry that I've been neglecting this PR. I will try to 
rebase this and investigate the perf questions a little bit



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


pitrou commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647673667


   Just a small question: why are `m` and `b` used for millions and billions, 
respectively? (I would probably expect `M` and `G`)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on a change in pull request #7477: ARROW-4221: Add canonical flag in COO sparse index

2020-06-22 Thread GitBox


pitrou commented on a change in pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#discussion_r443707259



##
File path: python/pyarrow/tensor.pxi
##
@@ -339,6 +350,15 @@ shape: {0.shape}""".format(self)
 def non_zero_length(self):
 return self.stp.non_zero_length()
 
+@property
+def has_canonical_format(self):
+cdef:
+_CSparseCOOIndexPtr csi
+
+csi = dynamic_cast[_CSparseCOOIndexPtr](self.stp.sparse_index().get())

Review comment:
   Why the `dynamic_cast`? Since this is a sparse COO tensor, the cast 
should always succeed. Just use a normal Cython cast.

##
File path: format/SparseTensor.fbs
##
@@ -63,6 +63,9 @@ table SparseTensorIndexCOO {
 
   /// The location and size of the indices matrix's data
   indicesBuffer: Buffer (required);
+
+  /// The canonicality flag

Review comment:
   What is this? Can you explain a bit more?

##
File path: python/pyarrow/tensor.pxi
##
@@ -270,8 +279,10 @@ shape: {0.shape}""".format(self)
   _data, _coords))
 data = PyObject_to_object(out_data)
 coords = PyObject_to_object(out_coords)
-result = coo_matrix((data[:, 0], (coords[:, 0], coords[:, 1])),
-shape=self.shape)
+row, col = coords[:, 0], coords[:, 1]
+result = coo_matrix((data[:, 0], (row, col)), shape=self.shape)
+if self.has_canonical_format:
+result.sum_duplicates()

Review comment:
   Why?

##
File path: cpp/src/arrow/sparse_tensor_test.cc
##
@@ -49,7 +49,10 @@ static inline void AssertCOOIndex(const 
std::shared_ptr& sidx, const int
   }
 }
 
-TEST(TestSparseCOOIndex, Make) {
+//-

Review comment:
   Perhaps there should also be some tests with empty/all-zero tensors?

##
File path: python/pyarrow/tensor.pxi
##
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
 for x in dim_names:
 c_dim_names.push_back(tobytes(x))
 
-coords = np.vstack([obj.row, obj.col]).T
+row = obj.row
+col = obj.col
+if obj.has_canonical_format:
+order = np.lexsort((col, row))  # row-major order

Review comment:
   I don't understand this. If Scipy says the indices are sorted, why do 
you need to sort them again?

##
File path: cpp/src/arrow/sparse_tensor.cc
##
@@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const 
std::shared_ptr& type,
   return Status::OK();
 }
 
+void get_coo_index_tensor_row(const std::shared_ptr& coords, const 
int64_t row,
+  std::vector* out_index) {
+  const auto& fw_index_value_type =
+  internal::checked_cast(*coords->type());
+  const size_t indices_elsize = fw_index_value_type.bit_width() / CHAR_BIT;
+
+  const auto& shape = coords->shape();
+  const int64_t non_zero_length = shape[0];
+  DCHECK(0 <= row && row < non_zero_length);
+
+  const int64_t ndim = shape[1];
+  std::vector index;
+  index.reserve(ndim);

Review comment:
   You can write directly into `out_index`:
   ```c++
   out_index->resize(ndim);
   // ...
 for (int64_t i = 0; i < ndim; ++i) {
   (*out_index)[i] = 
static_cast(coords->Value({row, i}));
 }
   ```

##
File path: python/pyarrow/tensor.pxi
##
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
 for x in dim_names:
 c_dim_names.push_back(tobytes(x))
 
-coords = np.vstack([obj.row, obj.col]).T
+row = obj.row
+col = obj.col
+if obj.has_canonical_format:

Review comment:
   Does this attribute exist? I don't see it in 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix
   In 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix,
 I see an attribute named `has_sorted_indices`

##
File path: cpp/src/arrow/sparse_tensor.cc
##
@@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const 
std::shared_ptr& type,
   return Status::OK();
 }
 
+void get_coo_index_tensor_row(const std::shared_ptr& coords, const 
int64_t row,

Review comment:
   Can you follow the style guide? Use CamelCase for function names.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox


wesm commented on pull request #7506:
URL: https://github.com/apache/arrow/pull/7506#issuecomment-647649670


   For curiosity, here are the same benchmarks on my laptop with gcc-8 (using 
the new output formatting from ARROW-9201)
   
   ```
 benchmarkbaseline   
contender  change %  regression
   46   CastUInt32ToInt32Safe/262144/1  994.591m items/sec4.487b 
items/sec   351.118   False
   31CastUInt32ToInt32Safe/32768/1  932.612m items/sec3.622b 
items/sec   288.383   False
   52CastInt64ToInt32Safe/262144/11.258b items/sec3.393b 
items/sec   169.693   False
   38 CastInt64ToInt32Safe/32768/11.170b items/sec2.703b 
items/sec   130.930   False
   26CastUInt32ToInt32Safe/262144/1000  742.665m items/sec1.217b 
items/sec63.901   False
   54 CastUInt32ToInt32Safe/32768/1000  696.762m items/sec1.132b 
items/sec62.468   False
   43   CastDoubleToInt32Safe/262144/11.493b items/sec2.333b 
items/sec56.229   False
   66CastDoubleToInt32Safe/32768/11.366b items/sec2.001b 
items/sec46.571   False
   61   CastInt64ToDoubleSafe/262144/11.489b items/sec2.162b 
items/sec45.245   False
   67CastInt64ToDoubleSafe/32768/11.370b items/sec1.854b 
items/sec35.316   False
   5 CastUInt32ToInt32Safe/32768/2  204.642m items/sec  253.501m 
items/sec23.876   False
   40CastInt64ToDoubleSafe/262144/1000  802.584m items/sec  987.873m 
items/sec23.087   False
   25   CastUInt32ToInt32Safe/32768/10  451.556m items/sec  546.766m 
items/sec21.085   False
   24CastDoubleToInt32Safe/262144/1000  744.363m items/sec  899.943m 
items/sec20.901   False
   3  CastInt64ToDoubleSafe/32768/1000  753.040m items/sec  906.901m 
items/sec20.432   False
   53 CastInt64ToInt32Safe/262144/1000  968.574m items/sec1.158b 
items/sec19.578   False
   49 CastDoubleToInt32Safe/32768/1000  697.915m items/sec  829.688m 
items/sec18.881   False
   55  CastInt64ToInt32Safe/32768/1000  893.465m items/sec1.051b 
items/sec17.641   False
   57CastInt64ToDoubleSafe/32768/2  215.963m items/sec  238.114m 
items/sec10.257   False
   15   CastUInt32ToInt32Safe/262144/2  212.126m items/sec  232.094m 
items/sec 9.414   False
   12   CastInt64ToDoubleSafe/32768/10  443.363m items/sec  483.006m 
items/sec 8.942   False
   18 CastInt64ToInt32Safe/32768/2  219.356m items/sec  236.947m 
items/sec 8.020   False
   7CastInt64ToDoubleSafe/262144/2  220.790m items/sec  225.300m 
items/sec 2.043   False
   69CastInt64ToInt32Unsafe/32768/10003.262b items/sec3.304b 
items/sec 1.278   False
   14  CastInt64ToInt32Unsafe/262144/04.271b items/sec4.306b 
items/sec 0.814   False
   22   CastInt64ToInt32Unsafe/32768/23.272b items/sec3.290b 
items/sec 0.546   False
   64CastInt64ToInt32Safe/262144/2  228.532m items/sec  229.766m 
items/sec 0.540   False
   56  CastUInt32ToInt32Safe/262144/10  467.505m items/sec  469.887m 
items/sec 0.509   False
   21   CastInt64ToInt32Unsafe/32768/03.332b items/sec3.343b 
items/sec 0.332   False
   13  CastInt64ToInt32Unsafe/262144/14.292b items/sec4.300b 
items/sec 0.195   False
   9   CastInt64ToInt32Unsafe/32768/103.284b items/sec3.290b 
items/sec 0.178   False
   20   CastInt64ToInt32Unsafe/32768/13.324b items/sec3.328b 
items/sec 0.125   False
   28CastInt64ToDoubleUnsafe/262144/102.507b items/sec2.510b 
items/sec 0.103   False
   36   CastInt64ToInt32Unsafe/262144/10004.273b items/sec4.274b 
items/sec 0.037   False
   68  CastInt64ToDoubleUnsafe/32768/12.125b items/sec2.122b 
items/sec-0.153   False
   39  CastInt64ToDoubleUnsafe/32768/02.137b items/sec2.132b 
items/sec-0.208   False
   19 CastInt64ToDoubleUnsafe/262144/22.511b items/sec2.505b 
items/sec-0.234   False
   33 CastInt64ToInt32Unsafe/262144/104.288b items/sec4.271b 
items/sec-0.395   False
   2   CastInt64ToDoubleUnsafe/32768/22.119b items/sec2.110b 
items/sec-0.455   False
   58  CastInt64ToInt32Unsafe/262144/24.301b items/sec4.277b 
items/sec-0.549   False
   48 CastInt64ToDoubleUnsafe/262144/12.545b items/sec2.528b 
items/sec-0.651   False
   0   CastInt64ToDoubleUnsafe/262144/10002.525b items/sec2.501b 
items/sec-0.941   False
   45 CastInt64ToDoubleUnsafe/262144/02.546b items/sec2.518b 
items/sec-1.098   False
   30 CastInt64ToDoubleUnsafe/32768/102.131b items/sec2.097b 
items/sec

[GitHub] [arrow] wesm commented on a change in pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox


wesm commented on a change in pull request #7506:
URL: https://github.com/apache/arrow/pull/7506#discussion_r443696884



##
File path: cpp/src/arrow/util/int_util.cc
##
@@ -461,75 +472,434 @@ Status IndexBoundsCheckImpl(const ArrayData& indices, 
uint64_t upper_limit) {
   if (indices.buffers[0]) {
 bitmap = indices.buffers[0]->data();
   }
-  auto IsOutOfBounds = [&](int64_t i) -> bool {
-return (
-(IsSigned && indices_data[i] < 0) ||
-(indices_data[i] >= 0 && static_cast(indices_data[i]) >= 
upper_limit));
+  auto IsOutOfBounds = [&](IndexCType val) -> bool {
+return ((IsSigned && val < 0) ||
+(val >= 0 && static_cast(val) >= upper_limit));
+  };
+  auto IsOutOfBoundsMaybeNull = [&](IndexCType val, bool is_valid) -> bool {
+return is_valid && ((IsSigned && val < 0) ||
+(val >= 0 && static_cast(val) >= 
upper_limit));
   };
   OptionalBitBlockCounter indices_bit_counter(bitmap, indices.offset, 
indices.length);
   int64_t position = 0;
+  int64_t offset_position = indices.offset;
   while (position < indices.length) {
 BitBlockCount block = indices_bit_counter.NextBlock();
 bool block_out_of_bounds = false;
 if (block.popcount == block.length) {
   // Fast path: branchless
   for (int64_t i = 0; i < block.length; ++i) {
-block_out_of_bounds |= IsOutOfBounds(i);
+block_out_of_bounds |= IsOutOfBounds(indices_data[i]);
   }
 } else if (block.popcount > 0) {
   // Indices have nulls, must only boundscheck non-null values
-  for (int64_t i = 0; i < block.length; ++i) {
-if (BitUtil::GetBit(bitmap, indices.offset + position + i)) {
-  block_out_of_bounds |= IsOutOfBounds(i);
+  int64_t i = 0;
+  for (int64_t chunk = 0; chunk < block.length / 8; ++chunk) {
+// Let the compiler unroll this
+for (int j = 0; j < 8; ++j) {
+  block_out_of_bounds |= IsOutOfBoundsMaybeNull(
+  indices_data[i], BitUtil::GetBit(bitmap, offset_position + i));
+  ++i;
 }
   }
+  for (; i < block.length; ++i) {
+block_out_of_bounds |= IsOutOfBoundsMaybeNull(
+indices_data[i], BitUtil::GetBit(bitmap, offset_position + i));
+  }
 }
 if (ARROW_PREDICT_FALSE(block_out_of_bounds)) {
   if (indices.GetNullCount() > 0) {
 for (int64_t i = 0; i < block.length; ++i) {
-  if (BitUtil::GetBit(bitmap, indices.offset + position + i)) {
-if (IsOutOfBounds(i)) {
-  return Status::IndexError("Index ", 
static_cast(indices_data[i]),
-" out of bounds");
-}
+  if (IsOutOfBoundsMaybeNull(indices_data[i],
+ BitUtil::GetBit(bitmap, offset_position + 
i))) {
+return Status::IndexError("Index ", FormatInt(indices_data[i]),
+  " out of bounds");
   }
 }
   } else {
 for (int64_t i = 0; i < block.length; ++i) {
-  if (IsOutOfBounds(i)) {
-return Status::IndexError("Index ", 
static_cast(indices_data[i]),
+  if (IsOutOfBounds(indices_data[i])) {
+return Status::IndexError("Index ", FormatInt(indices_data[i]),
   " out of bounds");
   }
 }
   }
 }
 indices_data += block.length;
 position += block.length;
+offset_position += block.length;
   }
   return Status::OK();
 }
 
 /// \brief Branchless boundschecking of the indices. Processes batches of
 /// indices at a time and shortcircuits when encountering an out-of-bounds
 /// index in a batch
-Status IndexBoundsCheck(const ArrayData& indices, uint64_t upper_limit) {
+Status CheckIndexBounds(const ArrayData& indices, uint64_t upper_limit) {
   switch (indices.type->id()) {
 case Type::INT8:
-  return IndexBoundsCheckImpl(indices, upper_limit);
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::INT16:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::INT32:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::INT64:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT8:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT16:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT32:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT64:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+default:
+  return Status::Invalid("Invalid index type for boundschecking");
+  }
+}
+
+// --
+// Utilities for casting from one integer type to another
+
+template 
+Status IntegersInRange(const Datum& datum, CType bound_lower, CType 
bound_upper) {
+  if (std::numeric_limits::lowest() >= bound_lower &&
+

[GitHub] [arrow] wesm commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


wesm commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647641007


   @kszucs can you assist me with adapting ursabot for these changes? I think 
we can use pandas's `DataFrame.to_html` to create a colorized table for GitHub, 
too https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
   
   Changes that would be good to have in `ursabot benchmark`:
   
   * Pass through `--cc` and `--cxx` options
   * Pass through `--repetitions`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7516:
URL: https://github.com/apache/arrow/pull/7516#issuecomment-647641059


   https://issues.apache.org/jira/browse/ARROW-9201



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm opened a new pull request #7516: ARROW-9201: [Archery] More user-friendly console output for benchmark diffs, add repetitions argument, don't build unit tests

2020-06-22 Thread GitBox


wesm opened a new pull request #7516:
URL: https://github.com/apache/arrow/pull/7516


   This uses pandas to generate a sorted text table when using `archery 
benchmark diff`. Example:
   
   https://github.com/apache/arrow/pull/7506#issuecomment-647633470
   
   There's some other incidental changes
   
   * pandas is required for `archery benchmark diff`. I don't think there's 
value in reimplementing the stuff that pandas can do in a few lines of code 
(read JSON, create a sorted table and print it nicely for us). 
   * The default # of benchmarks repetitions has been changed from 10 to 1 (see 
ARROW-9155 for context). IMHO more interactive benchmark results is more useful 
than higher precision. If you need higher precision you can pass 
`--repetitions=10` on the command line
   * `archery benchmark` was building the unit tests unnecessarily. This also 
occluded a bug ARROW-9209, which is fixed here



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on a change in pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox


wesm commented on a change in pull request #7506:
URL: https://github.com/apache/arrow/pull/7506#discussion_r443689349



##
File path: cpp/src/arrow/util/int_util.cc
##
@@ -461,75 +472,434 @@ Status IndexBoundsCheckImpl(const ArrayData& indices, 
uint64_t upper_limit) {
   if (indices.buffers[0]) {
 bitmap = indices.buffers[0]->data();
   }
-  auto IsOutOfBounds = [&](int64_t i) -> bool {
-return (
-(IsSigned && indices_data[i] < 0) ||
-(indices_data[i] >= 0 && static_cast(indices_data[i]) >= 
upper_limit));
+  auto IsOutOfBounds = [&](IndexCType val) -> bool {
+return ((IsSigned && val < 0) ||
+(val >= 0 && static_cast(val) >= upper_limit));
+  };
+  auto IsOutOfBoundsMaybeNull = [&](IndexCType val, bool is_valid) -> bool {
+return is_valid && ((IsSigned && val < 0) ||
+(val >= 0 && static_cast(val) >= 
upper_limit));
   };
   OptionalBitBlockCounter indices_bit_counter(bitmap, indices.offset, 
indices.length);
   int64_t position = 0;
+  int64_t offset_position = indices.offset;
   while (position < indices.length) {
 BitBlockCount block = indices_bit_counter.NextBlock();
 bool block_out_of_bounds = false;
 if (block.popcount == block.length) {
   // Fast path: branchless
   for (int64_t i = 0; i < block.length; ++i) {
-block_out_of_bounds |= IsOutOfBounds(i);
+block_out_of_bounds |= IsOutOfBounds(indices_data[i]);
   }
 } else if (block.popcount > 0) {
   // Indices have nulls, must only boundscheck non-null values
-  for (int64_t i = 0; i < block.length; ++i) {
-if (BitUtil::GetBit(bitmap, indices.offset + position + i)) {
-  block_out_of_bounds |= IsOutOfBounds(i);
+  int64_t i = 0;
+  for (int64_t chunk = 0; chunk < block.length / 8; ++chunk) {
+// Let the compiler unroll this
+for (int j = 0; j < 8; ++j) {
+  block_out_of_bounds |= IsOutOfBoundsMaybeNull(
+  indices_data[i], BitUtil::GetBit(bitmap, offset_position + i));
+  ++i;
 }
   }
+  for (; i < block.length; ++i) {
+block_out_of_bounds |= IsOutOfBoundsMaybeNull(
+indices_data[i], BitUtil::GetBit(bitmap, offset_position + i));
+  }
 }
 if (ARROW_PREDICT_FALSE(block_out_of_bounds)) {
   if (indices.GetNullCount() > 0) {
 for (int64_t i = 0; i < block.length; ++i) {
-  if (BitUtil::GetBit(bitmap, indices.offset + position + i)) {
-if (IsOutOfBounds(i)) {
-  return Status::IndexError("Index ", 
static_cast(indices_data[i]),
-" out of bounds");
-}
+  if (IsOutOfBoundsMaybeNull(indices_data[i],
+ BitUtil::GetBit(bitmap, offset_position + 
i))) {
+return Status::IndexError("Index ", FormatInt(indices_data[i]),
+  " out of bounds");
   }
 }
   } else {
 for (int64_t i = 0; i < block.length; ++i) {
-  if (IsOutOfBounds(i)) {
-return Status::IndexError("Index ", 
static_cast(indices_data[i]),
+  if (IsOutOfBounds(indices_data[i])) {
+return Status::IndexError("Index ", FormatInt(indices_data[i]),
   " out of bounds");
   }
 }
   }
 }
 indices_data += block.length;
 position += block.length;
+offset_position += block.length;
   }
   return Status::OK();
 }
 
 /// \brief Branchless boundschecking of the indices. Processes batches of
 /// indices at a time and shortcircuits when encountering an out-of-bounds
 /// index in a batch
-Status IndexBoundsCheck(const ArrayData& indices, uint64_t upper_limit) {
+Status CheckIndexBounds(const ArrayData& indices, uint64_t upper_limit) {
   switch (indices.type->id()) {
 case Type::INT8:
-  return IndexBoundsCheckImpl(indices, upper_limit);
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::INT16:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::INT32:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::INT64:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT8:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT16:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT32:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+case Type::UINT64:
+  return CheckIndexBoundsImpl(indices, upper_limit);
+default:
+  return Status::Invalid("Invalid index type for boundschecking");
+  }
+}
+
+// --
+// Utilities for casting from one integer type to another
+
+template 
+Status IntegersInRange(const Datum& datum, CType bound_lower, CType 
bound_upper) {
+  if (std::numeric_limits::lowest() >= bound_lower &&
+

[GitHub] [arrow] pitrou commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox


pitrou commented on pull request #7263:
URL: https://github.com/apache/arrow/pull/7263#issuecomment-647633067


   Well, can't you just use `BufferReader` and `BufferOutputStream`? Am I 
missing something?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] wesm commented on pull request #7506: ARROW-9197: [C++] Overhaul integer/floating point casting: vectorize truncation checks, reduce binary size

2020-06-22 Thread GitBox


wesm commented on pull request #7506:
URL: https://github.com/apache/arrow/pull/7506#issuecomment-647633470


   Here's the benchmark comparison with clang-8
   
   ```
   $ archery benchmark diff --cc=clang-8 --cxx=clang++-8 89844a100 653817301 
--benchmark-filter=Cast
 benchmarkbaseline   
contender  change %  regression
   41   CastUInt32ToInt32Safe/262144/1  769.933m items/sec4.474b 
items/sec   481.076   False
   62   CastUInt32ToInt32Safe/262144/0  409.277m items/sec2.189b 
items/sec   434.792   False
   15CastUInt32ToInt32Safe/32768/0  399.127m items/sec2.089b 
items/sec   423.357   False
   7 CastUInt32ToInt32Safe/32768/1  742.226m items/sec3.721b 
items/sec   401.307   False
   18CastInt64ToInt32Safe/262144/0  341.403m items/sec1.569b 
items/sec   359.706   False
   55CastUInt32ToInt32Safe/262144/1000  351.335m items/sec1.612b 
items/sec   358.917   False
   16 CastUInt32ToInt32Safe/32768/1000  334.147m items/sec1.484b 
items/sec   344.139   False
   47 CastInt64ToInt32Safe/32768/0  328.491m items/sec1.414b 
items/sec   330.412   False
   30CastInt64ToInt32Safe/262144/1  742.497m items/sec3.131b 
items/sec   321.638   False
   51 CastInt64ToInt32Safe/262144/1000  304.244m items/sec1.226b 
items/sec   302.928   False
   42  CastInt64ToInt32Safe/32768/1000  288.976m items/sec1.101b 
items/sec   280.942   False
   32 CastInt64ToInt32Safe/32768/1  706.339m items/sec2.552b 
items/sec   261.273   False
   45   CastDoubleToInt32Safe/262144/1  924.369m items/sec2.997b 
items/sec   224.214   False
   54   CastInt64ToDoubleSafe/262144/0  419.319m items/sec1.324b 
items/sec   215.783   False
   11CastInt64ToDoubleSafe/32768/0  408.425m items/sec1.216b 
items/sec   197.672   False
   49   CastDoubleToInt32Safe/262144/2  207.799m items/sec  614.658m 
items/sec   195.795   False
   9 CastDoubleToInt32Safe/32768/2  202.480m items/sec  584.558m 
items/sec   188.699   False
   23CastInt64ToDoubleSafe/262144/1000  375.572m items/sec1.078b 
items/sec   186.948   False
   50CastDoubleToInt32Safe/32768/1  869.447m items/sec2.445b 
items/sec   181.248   False
   21   CastInt64ToDoubleSafe/262144/1  790.625m items/sec2.222b 
items/sec   181.101   False
   59CastDoubleToInt32Safe/262144/1000  360.792m items/sec1.013b 
items/sec   180.714   False
   44 CastInt64ToDoubleSafe/32768/1000  360.492m items/sec  988.897m 
items/sec   174.319   False
   48 CastDoubleToInt32Safe/32768/1000  349.576m items/sec  932.771m 
items/sec   166.829   False
   58   CastDoubleToInt32Safe/262144/0  407.159m items/sec1.067b 
items/sec   162.086   False
   63CastInt64ToDoubleSafe/32768/1  746.561m items/sec1.893b 
items/sec   153.520   False
   8 CastDoubleToInt32Safe/32768/0  395.857m items/sec  990.704m 
items/sec   150.268   False
   67  CastDoubleToInt32Safe/262144/10  275.237m items/sec  612.002m 
items/sec   122.354   False
   10   CastDoubleToInt32Safe/32768/10  266.596m items/sec  583.346m 
items/sec   118.813   False
   69CastInt64ToInt32Safe/32768/10  232.545m items/sec  449.883m 
items/sec93.461   False
   64   CastUInt32ToInt32Safe/32768/10  256.845m items/sec  482.636m 
items/sec87.909   False
   61   CastInt64ToInt32Safe/262144/10  243.012m items/sec  435.232m 
items/sec79.099   False
   0   CastUInt32ToInt32Safe/262144/10  264.244m items/sec  466.981m 
items/sec76.723   False
   53   CastInt64ToDoubleSafe/32768/10  278.548m items/sec  441.752m 
items/sec58.591   False
   1   CastInt64ToDoubleSafe/262144/10  283.181m items/sec  431.990m 
items/sec52.549   False
   14 CastInt64ToInt32Safe/32768/2  170.844m items/sec  224.195m 
items/sec31.228   False
   37CastUInt32ToInt32Safe/32768/2  182.246m items/sec  238.051m 
items/sec30.621   False
   27   CastUInt32ToInt32Safe/262144/2  187.277m items/sec  231.385m 
items/sec23.553   False
   28CastInt64ToInt32Safe/262144/2  175.893m items/sec  216.887m 
items/sec23.306   False
   26CastInt64ToDoubleSafe/32768/2  189.465m items/sec  228.996m 
items/sec20.864   False
   3CastInt64ToDoubleSafe/262144/2  192.523m items/sec  219.324m 
items/sec13.921   False
   36   CastInt64ToInt32Unsafe/32768/02.993b items/sec3.227b 
items/sec 7.800   False
   35   CastInt64ToInt32Unsafe/32768/22.937b items/sec3.154b 
items/sec 7.367   False
   68   CastInt64ToInt32Unsafe/32768/12.966b items/sec3.176b 
items/sec 7.088   False
   43  CastInt64ToInt32Unsafe/32768/102.940b 

[GitHub] [arrow] alexbaden commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox


alexbaden commented on pull request #7263:
URL: https://github.com/apache/arrow/pull/7263#issuecomment-647628896


   I'd like to have an object similar to `RecordBatchStreamReader` that could 
read the schema, IPC handles, and dictionary memo from a single buffer, and a 
corresponding writer of course.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox


pitrou commented on pull request #7263:
URL: https://github.com/apache/arrow/pull/7263#issuecomment-647627751


   @alexbaden I'm not sure I understand what you have in mind. Could you 
elaborate?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] alexbaden commented on pull request #7263: ARROW-8927: [C++] Support dictionary memo in CUDA IPC ReadRecordBatch functions

2020-06-22 Thread GitBox


alexbaden commented on pull request #7263:
URL: https://github.com/apache/arrow/pull/7263#issuecomment-647623057


   I'd be interested in trying to add it into the record batch stream reader -- 
the idea being you could send both CPU data and GPU data pointers in a single 
serialized object with a single call. We could then replace this code 
https://github.com/omnisci/pymapd/blob/91520f021b2a92cc2a6affeb0e5084641e0637d7/pymapd/_parsers.py#L200
 with roughly a single call. I'm more comfortable in C++  so I might try it 
there, first (replacing this code: 
https://github.com/omnisci/omniscidb/blob/a84b42d324af2f4406abaaf83c465ce11c6039ab/Tests/ArrowIpcIntegrationTest.cpp#L378)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] github-actions[bot] commented on pull request #7515: ARROW-2801: [Python] Add split_row_group keyword to ParquetDataset / document split_by_row_group

2020-06-22 Thread GitBox


github-actions[bot] commented on pull request #7515:
URL: https://github.com/apache/arrow/pull/7515#issuecomment-647616782


   https://issues.apache.org/jira/browse/ARROW-2801



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7515: ARROW-2801: [Python] Add split_row_group keyword to ParquetDataset / document split_by_row_group

2020-06-22 Thread GitBox


jorisvandenbossche commented on a change in pull request #7515:
URL: https://github.com/apache/arrow/pull/7515#discussion_r443663674



##
File path: python/pyarrow/parquet.py
##
@@ -1404,27 +1403,36 @@ def __init__(self, path_or_paths, filesystem=None, 
filters=None,
 self._filter_expression = filters and _filters_to_expression(filters)
 
 # check for single NativeFile dataset
-if not isinstance(path_or_paths, list):
-if not _is_path_like(path_or_paths):
-fragment = parquet_format.make_fragment(path_or_paths)
-self._dataset = ds.FileSystemDataset(
-[fragment], schema=fragment.physical_schema,
-format=parquet_format
-)
-return
-
-# map old filesystems to new one
-# TODO(dataset) deal with other file systems
-if isinstance(filesystem, LocalFileSystem):
-filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)
-elif filesystem is None and memory_map:
-# if memory_map is specified, assume local file system (string
-# path can in principle be URI for any filesystem)
-filesystem = pyarrow.fs.LocalFileSystem(use_mmap=True)
-
-self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
-   format=parquet_format,
-   partitioning=partitioning)
+if (not isinstance(path_or_paths, list) and
+not _is_path_like(path_or_paths)):
+fragment = parquet_format.make_fragment(path_or_paths)
+dataset = ds.FileSystemDataset(
+[fragment], schema=fragment.physical_schema,
+format=parquet_format
+)
+else:
+# map old filesystems to new one
+# TODO(dataset) deal with other file systems
+if isinstance(filesystem, LocalFileSystem):
+filesystem = pyarrow.fs.LocalFileSystem(use_mmap=memory_map)
+elif filesystem is None and memory_map:
+# if memory_map is specified, assume local file system (string
+# path can in principle be URI for any filesystem)
+filesystem = pyarrow.fs.LocalFileSystem(use_mmap=True)
+
+dataset = ds.dataset(path_or_paths, filesystem=filesystem,
+ format=parquet_format,
+ partitioning=partitioning)
+
+if split_row_groups:
+fragments = dataset.get_fragments()
+fragments = [rg for fragment in fragments
+ for rg in fragment.split_by_row_group()]
+dataset = ds.FileSystemDataset(
+fragments, dataset.schema, dataset.format,
+dataset.partition_expression
+)

Review comment:
   This is basically what was requested in ARROW-2801, but I am not fully 
sure whether it is actually worth adding it here (we are adding it to 
ParquetDataset for which it is not yet clear we are keeping it in the future). 
And if we want it, it's maybe rather something to add in the actual Dataset 
class (or DatasetFactory).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche opened a new pull request #7515: ARROW-2801: [Python] Add split_row_group keyword to ParquetDataset / document split_by_row_group

2020-06-22 Thread GitBox


jorisvandenbossche opened a new pull request #7515:
URL: https://github.com/apache/arrow/pull/7515


   ARROW-2801
   
   Still WIP, didn't yet add tests



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: Add canonical flag in COO sparse index

2020-06-22 Thread GitBox


mrkn commented on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-647607661


   @rok Could you please review the pyarrow part?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on pull request #7477: ARROW-4221: Add canonical flag in COO sparse index

2020-06-22 Thread GitBox


mrkn commented on pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#issuecomment-647607307


   @wesm @pitrou Could you please review it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7477: ARROW-4221: Add canonical flag in COO sparse index

2020-06-22 Thread GitBox


mrkn commented on a change in pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#discussion_r443655002



##
File path: python/pyarrow/tensor.pxi
##
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
 for x in dim_names:
 c_dim_names.push_back(tobytes(x))
 
-coords = np.vstack([obj.row, obj.col]).T
+row = obj.row
+col = obj.col
+if obj.has_canonical_format:
+order = np.lexsort((col, row))  # row-major order

Review comment:
   Sorting indices is necessary to conserve the canonical format.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] pitrou closed pull request #7496: ARROW-7084: [C++] Check for full type equality in ArrayRangeEquals

2020-06-22 Thread GitBox


pitrou closed pull request #7496:
URL: https://github.com/apache/arrow/pull/7496


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7290: ARROW-1692: [Java] UnionArray round trip not working

2020-06-22 Thread GitBox


nealrichardson commented on pull request #7290:
URL: https://github.com/apache/arrow/pull/7290#issuecomment-647589883


   Assuming CI is still passing, is this good to merge? What is left to do, or 
who needs to approve?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] fsaintjacques commented on a change in pull request #7514: ARROW-6235 : [R] Implement conversion from arrow::BinaryArray to R character vector

2020-06-22 Thread GitBox


fsaintjacques commented on a change in pull request #7514:
URL: https://github.com/apache/arrow/pull/7514#discussion_r443639526



##
File path: r/src/array_to_vector.cpp
##
@@ -693,6 +741,9 @@ std::shared_ptr Converter::Make(const 
std::shared_ptr& type
 case Type::BOOL:
   return std::make_shared(std::move(arrays));
 
+case Type::BINARY:

Review comment:
   Since this is in your mental cache, we can accept FIXED_BINARY and 
LARGE_BINARY with minimal effort by using `GetView` instead (and some template 
dispatching because they don't share the same base class). LARGE_BINARY will 
require to bound check each value with INT32_MAX.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on a change in pull request #7297: ARROW-6945: [Rust][Integration] Run rust integration tests

2020-06-22 Thread GitBox


nealrichardson commented on a change in pull request #7297:
URL: https://github.com/apache/arrow/pull/7297#discussion_r443641664



##
File path: dev/archery/archery/integration/datagen.py
##
@@ -1492,21 +1492,25 @@ def _temp_path():
 
 generate_primitive_large_offsets_case([17, 20])
 .skip_category('Go')
-.skip_category('JS'),
+.skip_category('JS')
+.skip_category('Rust'),
 
 generate_null_case([10, 0])
+.skip_category('Rust')

Review comment:
   Didn't I see a patch merge recently that added null support? If you 
rebase, can you remove this skip?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nealrichardson commented on pull request #7272: ARROW-8314: [Python] Add a Table.select method to select a subset of columns

2020-06-22 Thread GitBox


nealrichardson commented on pull request #7272:
URL: https://github.com/apache/arrow/pull/7272#issuecomment-647587606


   > I added a C++ version (didn't yet update R to use it)
   
   Are you intending to make that change in this PR too?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] nevi-me commented on pull request #7297: ARROW-6945: [Rust][Integration] Run rust integration tests

2020-06-22 Thread GitBox


nevi-me commented on pull request #7297:
URL: https://github.com/apache/arrow/pull/7297#issuecomment-647579286


   Oh, there's also a TODO on array data comparisons. I think we only compare 
the array lengths and types for now; but not their data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >