[jira] [Resolved] (ARROW-6286) [GLib] Add support for LargeList type

2019-10-25 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-6286.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5710
[https://github.com/apache/arrow/pull/5710]

> [GLib] Add support for LargeList type
> -
>
> Key: ARROW-6286
> URL: https://issues.apache.org/jira/browse/ARROW-6286
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6998) Ability to read from URL for pyarrow's read_feather

2019-10-25 Thread Ryan McCarthy (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960197#comment-16960197
 ] 

Ryan McCarthy commented on ARROW-6998:
--

Thanks for the feedback! I'll look into submitting as PR or issue to pandas.

> Ability to read from URL for pyarrow's read_feather
> ---
>
> Key: ARROW-6998
> URL: https://issues.apache.org/jira/browse/ARROW-6998
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Ryan McCarthy
>Priority: Major
>
> See this [pandas issue|https://github.com/pandas-dev/pandas/issues/29055] for 
> more info. Many of the pandas `read_format()` methods allow you supply a URL 
> except for the `read_feather()` method. This would be a nice to have feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6986) [R] Add basic Expression class

2019-10-25 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6986.

Resolution: Fixed

Issue resolved by pull request 5730
[https://github.com/apache/arrow/pull/5730]

> [R] Add basic Expression class
> --
>
> Key: ARROW-6986
> URL: https://issues.apache.org/jira/browse/ARROW-6986
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I started this as part of ARROW-6980 but it proved not necessary. This will 
> be a foundation for ARROW-6982, in addition to being useful on its own.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6998) Ability to read from URL for pyarrow's read_feather

2019-10-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960184#comment-16960184
 ] 

Wes McKinney commented on ARROW-6998:
-

I think this is better as a pandas issue. We have avoided doing anything too 
magical in the pyarrow API

> Ability to read from URL for pyarrow's read_feather
> ---
>
> Key: ARROW-6998
> URL: https://issues.apache.org/jira/browse/ARROW-6998
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Ryan McCarthy
>Priority: Major
>
> See this [pandas issue|https://github.com/pandas-dev/pandas/issues/29055] for 
> more info. Many of the pandas `read_format()` methods allow you supply a URL 
> except for the `read_feather()` method. This would be a nice to have feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6997) [Packaging] Add support for RHEL

2019-10-25 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960177#comment-16960177
 ] 

Kouhei Sutou commented on ARROW-6997:
-

It seems that BinTray doesn't support symbolic link.
We should create suitable {{.repo}} for each environment.

> [Packaging] Add support for RHEL
> 
>
> Key: ARROW-6997
> URL: https://issues.apache.org/jira/browse/ARROW-6997
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>
> We need symbolic links to {{${VERSION}Server}} from {{${VERSION}}} such as 
> {{7Server}} from {{7}}. (Is it available on BinTray?)
> We also need to update install information. We can't install {{epel-release}} 
> by {{yum install -y epel-release}}. We need to specify URL explicitly: {{yum 
> install 
> https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm}}. See 
> https://fedoraproject.org/wiki/EPEL for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6997) [Packaging] Add support for RHEL

2019-10-25 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-6997:
---

 Summary: [Packaging] Add support for RHEL
 Key: ARROW-6997
 URL: https://issues.apache.org/jira/browse/ARROW-6997
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


We need symbolic links to {{${VERSION}Server}} from {{${VERSION}}} such as 
{{7Server}} from {{7}}. (Is it available on BinTray?)

We also need to update install information. We can't install {{epel-release}} 
by {{yum install -y epel-release}}. We need to specify URL explicitly: {{yum 
install 
https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm}}. See 
https://fedoraproject.org/wiki/EPEL for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6996) [Python] Expose boolean filter kernel on Table

2019-10-25 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960026#comment-16960026
 ] 

Neal Richardson commented on ARROW-6996:


I've implemented the C++ for this in ARROW-6784, just need to wrap that up and 
then you can add Python bindings.

> [Python] Expose boolean filter kernel on Table
> --
>
> Key: ARROW-6996
> URL: https://issues.apache.org/jira/browse/ARROW-6996
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe Korn
>Priority: Major
>  Labels: iceberg
>
> This is currently only implemented for Array but would also be useful on 
> Tables and ChunkedArrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6998) Ability to read from URL for pyarrow's read_feather

2019-10-25 Thread Ryan McCarthy (Jira)
Ryan McCarthy created ARROW-6998:


 Summary: Ability to read from URL for pyarrow's read_feather
 Key: ARROW-6998
 URL: https://issues.apache.org/jira/browse/ARROW-6998
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Ryan McCarthy


See this [pandas issue|https://github.com/pandas-dev/pandas/issues/29055] for 
more info. Many of the pandas `read_format()` methods allow you supply a URL 
except for the `read_feather()` method. This would be a nice to have feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6996) [Python] Expose boolean filter kernel on Table

2019-10-25 Thread Uwe Korn (Jira)
Uwe Korn created ARROW-6996:
---

 Summary: [Python] Expose boolean filter kernel on Table
 Key: ARROW-6996
 URL: https://issues.apache.org/jira/browse/ARROW-6996
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Uwe Korn


This is currently only implemented for Array but would also be useful on Tables 
and ChunkedArrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6988) [CI][R] Buildbot's R Conda is failing

2019-10-25 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959967#comment-16959967
 ] 

Krisztian Szucs commented on ARROW-6988:


I've managed to reproduce it locally, but I don't have time to investigate it 
further right now, so I've turned it off.

> [CI][R] Buildbot's R Conda is failing
> -
>
> Key: ARROW-6988
> URL: https://issues.apache.org/jira/browse/ARROW-6988
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> {code:java}
>   Running ‘testthat.R’
>  ERROR
> Running the tests in ‘tests/testthat.R’ failed.
> Last 13 lines of output:
>   25: tryCatch(withCallingHandlers({eval(code, test_env)if (!handled 
> && !is.null(test)) {skip_empty()}}, expectation = 
> handle_expectation, skip = handle_skip, warning = handle_warning, message 
> = handle_message, error = handle_error), error = handle_fatal, skip = 
> function(e) {})
>   26: test_code(NULL, exprs, env)
>   27: source_file(path, new.env(parent = env), chdir = TRUE, wrap = wrap)
>   28: force(code)
>   29: with_reporter(reporter = reporter, start_end_reporter = 
> start_end_reporter, {reporter$start_file(basename(path))
> lister$start_file(basename(path))source_file(path, new.env(parent = 
> env), chdir = TRUE, wrap = wrap)reporter$.end_context()   
>  reporter$end_file()})
>   30: FUN(X[[i]], ...)
>   31: lapply(paths, test_file, env = env, reporter = current_reporter, 
> start_end_reporter = FALSE, load_helpers = FALSE, wrap = wrap)
>   32: force(code)
>   33: with_reporter(reporter = current_reporter, results <- lapply(paths, 
> test_file, env = env, reporter = current_reporter, start_end_reporter = 
> FALSE, load_helpers = FALSE, wrap = wrap))
>   34: test_files(paths, reporter = reporter, env = env, stop_on_failure = 
> stop_on_failure, stop_on_warning = stop_on_warning, wrap = wrap)
>   35: test_dir(path = test_path, reporter = reporter, env = env, filter = 
> filter, ..., stop_on_failure = stop_on_failure, stop_on_warning = 
> stop_on_warning, wrap = wrap)
>   36: test_package_dir(package = package, test_path = test_path, filter = 
> filter, reporter = reporter, ..., stop_on_failure = stop_on_failure, 
> stop_on_warning = stop_on_warning, wrap = wrap)
>   37: test_check("arrow")
>   An irrecoverable exception occurred. R is aborting now ...
>   Segmentation fault (core dumped)
> * checking for unstated dependencies in vignettes ... OK
> * checking package vignettes in ‘inst/doc’ ... OK
> * checking re-building of vignette outputs ... OK
> * DONE
> Status: 1 ERROR, 1 WARNING, 2 NOTEs
> See
>   ‘/buildbot/AMD64_Conda_R/r/arrow.Rcheck/00check.log’
> for details.
>  {code}
> [|https://ci.ursalabs.org/#/builders/95] 
> [https://ci.ursalabs.org/#/builders/95/builds/2386] 
> [https://ci.ursalabs.org/#/builders/95]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6975) [C++] Put make_unique in its own header

2019-10-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6975:

Fix Version/s: 1.0.0

> [C++] Put make_unique in its own header
> ---
>
> Key: ARROW-6975
> URL: https://issues.apache.org/jira/browse/ARROW-6975
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 1.0.0
>
>
> {{arrow/util/stl.h}} carries other stuff that is almost never necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5679) [Python] Drop Python 3.5 from support matrix

2019-10-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959951#comment-16959951
 ] 

Wes McKinney commented on ARROW-5679:
-

I changed the issue scope to drop Python 3.5 altogether. NumPy is about to drop 
3.5 so we should too

https://numpy.org/neps/nep-0029-deprecation_policy.html

> [Python] Drop Python 3.5 from support matrix
> 
>
> Key: ARROW-5679
> URL: https://issues.apache.org/jira/browse/ARROW-5679
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We probably need to maintain Python 3.5 on Linux and macOS for the time 
> being, but we may want to drop it for Windows since conda-forge isn't 
> supporting Python 3.5 anymore, so maintaining wheels for Python 3.5 will come 
> with extra cost



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5679) [Python] Drop Python 3.5 from support matrix

2019-10-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5679:

Summary: [Python] Drop Python 3.5 from support matrix  (was: [Python] Drop 
Python 3.5 from Windows wheel builds)

> [Python] Drop Python 3.5 from support matrix
> 
>
> Key: ARROW-5679
> URL: https://issues.apache.org/jira/browse/ARROW-5679
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> We probably need to maintain Python 3.5 on Linux and macOS for the time 
> being, but we may want to drop it for Windows since conda-forge isn't 
> supporting Python 3.5 anymore, so maintaining wheels for Python 3.5 will come 
> with extra cost



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6995) [Packaging][Crossbow] The windows conda artifacts are not uploaded to GitHub releases

2019-10-25 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6995:
--

 Summary: [Packaging][Crossbow] The windows conda artifacts are not 
uploaded to GitHub releases
 Key: ARROW-6995
 URL: https://issues.apache.org/jira/browse/ARROW-6995
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


The artifacts should be uploaded under the appropriate tag: 
https://github.com/ursa-labs/crossbow/releases/tag/ursabot-289-azure-conda-win-vs2015-py37

Most certainly the artifacts are produced in a different directory than 
previously, so the uploading script cannot find it 
https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2180



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6994) [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable

2019-10-25 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6994:
---

 Summary: [C++] Research jemalloc memory page reclamation 
configuration on macOS when background_thread option is unavailable
 Key: ARROW-6994
 URL: https://issues.apache.org/jira/browse/ARROW-6994
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


In ARROW-6977, this was disabled on macOS, but this will potentially have 
negative performance and memory implications that were intended to have been 
fixed in ARROW-6910



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6951) [C++][Dataset] Ensure column projection is passed to ParquetDataFragment

2019-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6951:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Ensure column projection is passed to ParquetDataFragment
> 
>
> Key: ARROW-6951
> URL: https://issues.apache.org/jira/browse/ARROW-6951
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2019-10-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959827#comment-16959827
 ] 

Wes McKinney commented on ARROW-300:


There are some discussions on going at 
https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959816#comment-16959816
 ] 

Wes McKinney commented on ARROW-6985:
-

Really wide tables are likely causing heap fragmentation to take place, so 
degraded memory performance is a likely culprit but there could be something 
else going on. 

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959779#comment-16959779
 ] 

Casey commented on ARROW-6985:
--

Okay looks like the wide matrix case was explained in the ticket you linked. As 
for the loop slowdown I'm seeing it gradually increase over time depending on 
the data's shape.

Below are my results for a wide matrix, the wide matrix transposed, and the 
matrix unraveled as a column. On the number of loops I tried this with, I see 
about 2x in the wide and wide transpose cases though the trend of the line 
indicates it will continue growing. Is this expected?

!image-2019-10-25-14-52-46-165.png|width=479,height=273!

!image-2019-10-25-14-53-37-623.png|width=483,height=253!

!image-2019-10-25-14-54-32-583.png|width=462,height=246!

 

 

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Casey updated ARROW-6985:
-
Attachment: image-2019-10-25-14-54-32-583.png

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Casey updated ARROW-6985:
-
Attachment: image-2019-10-25-14-53-37-623.png

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png, 
> image-2019-10-25-14-53-37-623.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Casey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Casey updated ARROW-6985:
-
Attachment: image-2019-10-25-14-52-46-165.png

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
> Attachments: image-2019-10-25-14-52-46-165.png
>
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6952) [C++][Dataset] Ensure expression filter is passed ParquetDataFragment

2019-10-25 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6952:
-

Assignee: Francois Saint-Jacques  (was: Ben Kietzman)

> [C++][Dataset] Ensure expression filter is passed ParquetDataFragment
> -
>
> Key: ARROW-6952
> URL: https://issues.apache.org/jira/browse/ARROW-6952
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Dataset
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> We should be able to prune RowGroups based on the expression and the 
> statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC

2019-10-25 Thread Gawain BOLTON (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gawain BOLTON updated ARROW-6992:
-
Priority: Minor  (was: Major)

> [C++]: Undefined Behavior sanitizer build option fails with GCC
> ---
>
> Key: ARROW-6992
> URL: https://issues.apache.org/jira/browse/ARROW-6992
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Gawain BOLTON
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Then build with the "undefined behaviour sanitizer" option 
> (-DARROW_USE_UBSAN=ON) the compilation fails with GCC:
> {noformat}
> c++: error: unrecognized argument to ‘-fno-sanitize=’ option: 
> ‘function’{noformat}
> It appears that GCC has never had a "-fsanitize=function" option.
> I have fixed this issue and will submit a PR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC

2019-10-25 Thread Gawain BOLTON (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gawain BOLTON updated ARROW-6992:
-
Issue Type: Bug  (was: Improvement)

> [C++]: Undefined Behavior sanitizer build option fails with GCC
> ---
>
> Key: ARROW-6992
> URL: https://issues.apache.org/jira/browse/ARROW-6992
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Then build with the "undefined behaviour sanitizer" option 
> (-DARROW_USE_UBSAN=ON) the compilation fails with GCC:
> {noformat}
> c++: error: unrecognized argument to ‘-fno-sanitize=’ option: 
> ‘function’{noformat}
> It appears that GCC has never had a "-fsanitize=function" option.
> I have fixed this issue and will submit a PR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6966) [Go] 32bit memset is null

2019-10-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6966.
-
Resolution: Fixed

Issue resolved by pull request 5714
[https://github.com/apache/arrow/pull/5714]

> [Go] 32bit memset is null
> -
>
> Key: ARROW-6966
> URL: https://issues.apache.org/jira/browse/ARROW-6966
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Jonathan A Sternberg
>Assignee: Jonathan A Sternberg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If you use a function that calls `memset.Set`, the implementation on a 32 bit 
> machine seems to be unset. This happened in our 32 bit build here:
> [https://circleci.com/gh/influxdata/influxdb/66112#tests/containers/2]
> {code:java}
> goroutine 66 [running]:goroutine 66 
> [running]:testing.tRunner.func1(0x9e1f2c0) 
> /usr/local/go/src/testing/testing.go:830 +0x30epanic(0x899cb40, 0x9403c40) 
> /usr/local/go/src/runtime/panic.go:522 
> +0x16egithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory.Set(...)
>  
> /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory/memory.go:25github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).init(0x9e44990,
>  0x20) 
> /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:101
>  
> +0xc7github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).init(0x9e44990,
>  0x20) 
> /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:102
>  
> +0x2fgithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Resize(0x9e44990,
>  0x2) 
> /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:125
>  
> +0x42github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).reserve(0x9e44990,
>  0x1, 0x9c52464) 
> /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:138
>  
> +0x72github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Reserve(0x9e44990,
>  0x1) 
> /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:113
>  
> +0x51github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow.NewInt(0x9e4a770,
>  0x1, 0x1, 0x0, 0x89f0360) 
> /root/go/src/github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow/int.go:10
>  
> +0x6cgithub.com/influxdata/influxdb/storage/reads.(*floatTable).advance(0x9e42070,
>  0x0) 
> /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:91 
> +0x7egithub.com/influxdata/influxdb/storage/reads.newFloatTable(0x9e17740, 
> 0xe521a160, 0x9e1b8c0, 0x0, 0x0, 0x1e, 0x0, 0x8c13be0, 0x9e448a0, 0x9e448d0, 
> ...) 
> /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:47 
> +0x1c2github.com/influxdata/influxdb/storage/reads.(*filterIterator).handleRead(0x9e22840,
>  0x9e0d1a0, 0x8c0ce00, 0x9e48780, 0x0, 0x0) 
> /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:177 
> +0x755github.com/influxdata/influxdb/storage/reads.(*filterIterator).Do(0x9e22840,
>  0x9e0d170, 0x9c40070, 0x0) 
> /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:140 
> +0x138github.com/influxdata/influxdb/storage/reads_test.TestDuplicateKeys_ReadFilter(0x9e1f2c0)
>  /root/go/src/github.com/influxdata/influxdb/storage/reads/reader_test.go:89 
> +0x1dftesting.tRunner(0x9e1f2c0, 0x8ad44e4) 
> /usr/local/go/src/testing/testing.go:865 +0x97created by testing.(*T).Run 
> /usr/local/go/src/testing/testing.go:916 +0x2b2
> {code}
> I added a print statement at where memset happened to print the function that 
> was being used and got this:
> {code}
>  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 0
> {code}
> If I set {{memset}} with a default, the code that calls into this works fine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC

2019-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6992:
--
Labels: pull-request-available  (was: )

> [C++]: Undefined Behavior sanitizer build option fails with GCC
> ---
>
> Key: ARROW-6992
> URL: https://issues.apache.org/jira/browse/ARROW-6992
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Gawain BOLTON
>Priority: Major
>  Labels: pull-request-available
>
> Then build with the "undefined behaviour sanitizer" option 
> (-DARROW_USE_UBSAN=ON) the compilation fails with GCC:
> {noformat}
> c++: error: unrecognized argument to ‘-fno-sanitize=’ option: 
> ‘function’{noformat}
> It appears that GCC has never had a "-fsanitize=function" option.
> I have fixed this issue and will submit a PR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6989) [Python][C++] Assert is triggered when decimal type inference occurs on a value with out of range precision

2019-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6989:
--
Labels: pull-request-available  (was: )

> [Python][C++] Assert is triggered when decimal type inference occurs on a 
> value with out of range precision
> ---
>
> Key: ARROW-6989
> URL: https://issues.apache.org/jira/browse/ARROW-6989
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
>
> Example:
> pa.array([decimal.Decimal(123.234)] )
>  
> The problem is that inference.cc calls the direct constructor for decimal 
> types instead using Make.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6989) [Python][C++] Assert is triggered when decimal type inference occurs on a value with out of range precision

2019-10-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-6989:


Assignee: Joris Van den Bossche

> [Python][C++] Assert is triggered when decimal type inference occurs on a 
> value with out of range precision
> ---
>
> Key: ARROW-6989
> URL: https://issues.apache.org/jira/browse/ARROW-6989
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Joris Van den Bossche
>Priority: Major
>
> Example:
> pa.array([decimal.Decimal(123.234)] )
>  
> The problem is that inference.cc calls the direct constructor for decimal 
> types instead using Make.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6993) [CI] Macos SDK installation fails on Travis

2019-10-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6993:
---
Summary: [CI]  Macos SDK installation fails on Travis  (was: [CI] Pass 
-allowUntrasted flag during Macos SDK installation on Travis)

> [CI]  Macos SDK installation fails on Travis
> 
>
> Key: ARROW-6993
> URL: https://issues.apache.org/jira/browse/ARROW-6993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6993) [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis

2019-10-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6993:
---
Issue Type: Bug  (was: Improvement)

> [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis
> --
>
> Key: ARROW-6993
> URL: https://issues.apache.org/jira/browse/ARROW-6993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6993) [CI] Macos SDK installation fails on Travis

2019-10-25 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-6993:
---
Description: 
See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324

Pass -allowUntrasted flag during the installation.

  was:See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324


> [CI]  Macos SDK installation fails on Travis
> 
>
> Key: ARROW-6993
> URL: https://issues.apache.org/jira/browse/ARROW-6993
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324
> Pass -allowUntrasted flag during the installation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6993) [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis

2019-10-25 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6993:
--

 Summary: [CI] Pass -allowUntrasted flag during Macos SDK 
installation on Travis
 Key: ARROW-6993
 URL: https://issues.apache.org/jira/browse/ARROW-6993
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Krisztian Szucs


See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

2019-10-25 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959653#comment-16959653
 ] 

Joris Van den Bossche commented on ARROW-6985:
--

[~CHDev93] thanks for the report. There was a performance regression regarding 
parquet files with many columns in 0.15.0 (see ARROW-6876, fixed on master and 
will shortly be released as 0.15.1). So that could clarify at least a general 
slowdown. 

How much do you see it slow down during the loop? 
I ran your code and possibly see some slowdown (max 2x), but it's a bit noisy.

> [Python] Steadily increasing time to load file using read_parquet
> -
>
> Key: ARROW-6985
> URL: https://issues.apache.org/jira/browse/ARROW-6985
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.13.0, 0.14.0, 0.15.0
>Reporter: Casey
>Priority: Minor
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
> mat = np.zeros((6000, 26000))
> mat.ravel()[::100] = np.random.randn(60 * 26000)
> df = pd.DataFrame(mat.T)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
> start = time.time()
> new_df = pd.read_parquet(file)
> end = time.time()
> timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0

2019-10-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6876:
-
Summary: [Python] Reading parquet file with many columns becomes slow for 
0.15.0  (was: [Python] Reading parquet file becomes really slow for 0.15.0)

> [Python] Reading parquet file with many columns becomes slow for 0.15.0
> ---
>
> Key: ARROW-6876
> URL: https://issues.apache.org/jira/browse/ARROW-6876
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: python3.7
>Reporter: Bob
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
> Attachments: image-2019-10-14-18-10-42-850.png, 
> image-2019-10-14-18-12-07-652.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Hi,
>  
> I just noticed that reading a parquet file becomes really slow after I 
> upgraded to 0.15.0 when using pandas.
>  
> Example:
> *With 0.14.1*
>  In [4]: %timeit df = pd.read_parquet(path)
>  2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> *With 0.15.0*
>  In [5]: %timeit df = pd.read_parquet(path)
>  22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  
> The file is about 15MB in size. I am testing on the same machine using the 
> same version of python and pandas.
>  
> Have you received similar complain? What could be the issue here?
>  
> Thanks a lot.
>  
>  
> Edit1:
> Some profiling I did:
> 0.14.1:
> !image-2019-10-14-18-12-07-652.png!
>  
> 0.15.0:
> !image-2019-10-14-18-10-42-850.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC

2019-10-25 Thread Gawain BOLTON (Jira)
Gawain BOLTON created ARROW-6992:


 Summary: [C++]: Undefined Behavior sanitizer build option fails 
with GCC
 Key: ARROW-6992
 URL: https://issues.apache.org/jira/browse/ARROW-6992
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Gawain BOLTON


Then build with the "undefined behaviour sanitizer" option 
(-DARROW_USE_UBSAN=ON) the compilation fails with GCC:
{noformat}
c++: error: unrecognized argument to ‘-fno-sanitize=’ option: 
‘function’{noformat}
It appears that GCC has never had a "-fsanitize=function" option.

I have fixed this issue and will submit a PR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6984) [C++] Update LZ4 to 1.9.2 for CVE-2019-17543

2019-10-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959595#comment-16959595
 ] 

Antoine Pitrou commented on ARROW-6984:
---

Also we call none of the functions mentioned in 
https://nvd.nist.gov/vuln/detail/CVE-2019-17543 ({{LZ4_write32}}, 
{{LZ4_compress_destSize}}, {{LZ4_compress_fast}}).

> [C++] Update LZ4 to 1.9.2 for CVE-2019-17543
> 
>
> Key: ARROW-6984
> URL: https://issues.apache.org/jira/browse/ARROW-6984
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Sangeeth Keeriyadath
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There is a reported CVE that LZ4 before 1.9.2 has a heap-based buffer 
> overflow in LZ4_write32 (More details in here - 
> [https://nvd.nist.gov/vuln/detail/CVE-2019-17543] ). I see that Apache Arrow 
> uses *v1.8.3* version ( 
> [https://github.com/apache/arrow/blob/47e5ecafa72b70112a64a1174b29b9db45f803ef/cpp/thirdparty/versions.txt#L38]
>  ).
> We need to bump up the dependency version of LZ4 to *1.9.2* to get past the 
> reported CVE. Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6984) [C++] Update LZ4 to 1.9.2 for CVE-2019-17543

2019-10-25 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959592#comment-16959592
 ] 

Antoine Pitrou commented on ARROW-6984:
---

According to https://github.com/lz4/lz4/issues/801:

> the conditions required to trigger it are not trivial. Actually, in most 
> systems, including the lz4 frame format and API, the bug is just out of reach.

> since it also requires multiple uncommon constraints on the encoder side, 
> which are out of direct control from an external actor (in contrast with the 
> payload), this bug is rarely "reachable", making it a poor exploit vector.

So this doesn't sound mandatory for 0.15.1.

> [C++] Update LZ4 to 1.9.2 for CVE-2019-17543
> 
>
> Key: ARROW-6984
> URL: https://issues.apache.org/jira/browse/ARROW-6984
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Sangeeth Keeriyadath
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> There is a reported CVE that LZ4 before 1.9.2 has a heap-based buffer 
> overflow in LZ4_write32 (More details in here - 
> [https://nvd.nist.gov/vuln/detail/CVE-2019-17543] ). I see that Apache Arrow 
> uses *v1.8.3* version ( 
> [https://github.com/apache/arrow/blob/47e5ecafa72b70112a64a1174b29b9db45f803ef/cpp/thirdparty/versions.txt#L38]
>  ).
> We need to bump up the dependency version of LZ4 to *1.9.2* to get past the 
> reported CVE. Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6976) Possible memory leak in pyarrow read_parquet

2019-10-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-6976.

Resolution: Duplicate

> Possible memory leak in pyarrow read_parquet
> 
>
> Key: ARROW-6976
> URL: https://issues.apache.org/jira/browse/ARROW-6976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: linux ubuntu 18.04
>Reporter: david cottrell
>Priority: Critical
> Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, 
> pyarrow_0150.png
>
>
>  
> Version and repro info in the gist below.
> Not sure if I'm not understanding something from this 
> [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/]
> but there seems to be memory accumulation when that is exacerbated with 
> higher arity objects like strings and dates (not datetimes).
>  
> I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed 
> to "fix" or lessen the problem.
>  
> [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62]
>  
> Let me know if this post should go elsewhere.
> !image-2019-10-23-16-17-20-739.png!
>  
> {code:java}
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6874) [Python] Memory leak in Table.to_pandas() when conversion to object dtype

2019-10-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6874:
-
Summary: [Python] Memory leak in Table.to_pandas() when conversion to 
object dtype  (was: [Python] Memory leak in Table.to_pandas() when nested 
columns are present)

> [Python] Memory leak in Table.to_pandas() when conversion to object dtype
> -
>
> Key: ARROW-6874
> URL: https://issues.apache.org/jira/browse/ARROW-6874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: Operating system: Windows 10
> pyarrow installed via conda
> both python environments were identical except pyarrow: 
> python: 3.6.7
> numpy: 1.17.2
> pandas: 0.25.1
>Reporter: Sergey Mozharov
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0, 0.15.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python 
> interpreter ran out of memory.
> I narrowed the issue down to the pyarrow.Table.to_pandas() call, which 
> appears to have a memory leak in the latest version. See details below to 
> reproduce this issue.
>  
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> # create a table with one nested array column
> nested_array = pa.array([np.random.rand(1000) for i in range(500)])
> nested_array.type  # ListType(list)
> table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
> # convert it to a pandas DataFrame in a loop to monitor memory consumption
> num_iterations = 1
> # pyarrow v0.14.1: Memory allocation does not grow during loop execution
> # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table)
> # When the table column is not nested, no memory leak is observed
> array = pa.array(np.random.rand(500 * 1000))
> table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
> # no memory leak:
> for i in range(num_iterations):
> df = pa.Table.to_pandas(table){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6976) Possible memory leak in pyarrow read_parquet

2019-10-25 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959577#comment-16959577
 ] 

Joris Van den Bossche commented on ARROW-6976:
--

Yes, I can confirm it is fixed. Running your script on 0.15.0 vs master:

!pyarrow_0150.png!

!pyarrow-master.png!

> Possible memory leak in pyarrow read_parquet
> 
>
> Key: ARROW-6976
> URL: https://issues.apache.org/jira/browse/ARROW-6976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: linux ubuntu 18.04
>Reporter: david cottrell
>Priority: Critical
> Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, 
> pyarrow_0150.png
>
>
>  
> Version and repro info in the gist below.
> Not sure if I'm not understanding something from this 
> [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/]
> but there seems to be memory accumulation when that is exacerbated with 
> higher arity objects like strings and dates (not datetimes).
>  
> I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed 
> to "fix" or lessen the problem.
>  
> [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62]
>  
> Let me know if this post should go elsewhere.
> !image-2019-10-23-16-17-20-739.png!
>  
> {code:java}
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6976) Possible memory leak in pyarrow read_parquet

2019-10-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6976:
-
Attachment: pyarrow-master.png

> Possible memory leak in pyarrow read_parquet
> 
>
> Key: ARROW-6976
> URL: https://issues.apache.org/jira/browse/ARROW-6976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: linux ubuntu 18.04
>Reporter: david cottrell
>Priority: Critical
> Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, 
> pyarrow_0150.png
>
>
>  
> Version and repro info in the gist below.
> Not sure if I'm not understanding something from this 
> [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/]
> but there seems to be memory accumulation when that is exacerbated with 
> higher arity objects like strings and dates (not datetimes).
>  
> I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed 
> to "fix" or lessen the problem.
>  
> [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62]
>  
> Let me know if this post should go elsewhere.
> !image-2019-10-23-16-17-20-739.png!
>  
> {code:java}
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6976) Possible memory leak in pyarrow read_parquet

2019-10-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-6976:
-
Attachment: pyarrow_0150.png

> Possible memory leak in pyarrow read_parquet
> 
>
> Key: ARROW-6976
> URL: https://issues.apache.org/jira/browse/ARROW-6976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: linux ubuntu 18.04
>Reporter: david cottrell
>Priority: Critical
> Attachments: image-2019-10-23-16-17-20-739.png, pyarrow_0150.png
>
>
>  
> Version and repro info in the gist below.
> Not sure if I'm not understanding something from this 
> [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/]
> but there seems to be memory accumulation when that is exacerbated with 
> higher arity objects like strings and dates (not datetimes).
>  
> I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed 
> to "fix" or lessen the problem.
>  
> [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62]
>  
> Let me know if this post should go elsewhere.
> !image-2019-10-23-16-17-20-739.png!
>  
> {code:java}
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2019-10-25 Thread Yuan Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959552#comment-16959552
 ] 

Yuan Zhou commented on ARROW-300:
-

Hi [~wesm], thanks for providing the general idea, I'm quite interested in this 
feature. Do you happen to have some updates on the detail proposal?   

Cheers, -yuan

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6976) Possible memory leak in pyarrow read_parquet

2019-10-25 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959547#comment-16959547
 ] 

Joris Van den Bossche commented on ARROW-6976:
--

[~cottrell] thanks for the report! I _suppose_ this is the same as what was 
reported in ARROW-6874. That was about the conversion of Arrow tables to pandas 
involving object dtypes (so not the actual parquet code).  That issue is fixed 
on master, and will also be shortly released in 0.15.1

> Possible memory leak in pyarrow read_parquet
> 
>
> Key: ARROW-6976
> URL: https://issues.apache.org/jira/browse/ARROW-6976
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.0
> Environment: linux ubuntu 18.04
>Reporter: david cottrell
>Priority: Critical
> Attachments: image-2019-10-23-16-17-20-739.png
>
>
>  
> Version and repro info in the gist below.
> Not sure if I'm not understanding something from this 
> [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/]
> but there seems to be memory accumulation when that is exacerbated with 
> higher arity objects like strings and dates (not datetimes).
>  
> I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed 
> to "fix" or lessen the problem.
>  
> [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62]
>  
> Let me know if this post should go elsewhere.
> !image-2019-10-23-16-17-20-739.png!
>  
> {code:java}
>  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6991) [Packaging][deb] Add support for Ubuntu 19.10

2019-10-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6991:
--
Labels: pull-request-available  (was: )

> [Packaging][deb] Add support for Ubuntu 19.10
> -
>
> Key: ARROW-6991
> URL: https://issues.apache.org/jira/browse/ARROW-6991
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6991) [Packaging][deb] Add support for Ubuntu 19.10

2019-10-25 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-6991:
---

 Summary: [Packaging][deb] Add support for Ubuntu 19.10
 Key: ARROW-6991
 URL: https://issues.apache.org/jira/browse/ARROW-6991
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)