[jira] [Resolved] (ARROW-6286) [GLib] Add support for LargeList type
[ https://issues.apache.org/jira/browse/ARROW-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-6286. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5710 [https://github.com/apache/arrow/pull/5710] > [GLib] Add support for LargeList type > - > > Key: ARROW-6286 > URL: https://issues.apache.org/jira/browse/ARROW-6286 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6998) Ability to read from URL for pyarrow's read_feather
[ https://issues.apache.org/jira/browse/ARROW-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960197#comment-16960197 ] Ryan McCarthy commented on ARROW-6998: -- Thanks for the feedback! I'll look into submitting as PR or issue to pandas. > Ability to read from URL for pyarrow's read_feather > --- > > Key: ARROW-6998 > URL: https://issues.apache.org/jira/browse/ARROW-6998 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Ryan McCarthy >Priority: Major > > See this [pandas issue|https://github.com/pandas-dev/pandas/issues/29055] for > more info. Many of the pandas `read_format()` methods allow you supply a URL > except for the `read_feather()` method. This would be a nice to have feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6986) [R] Add basic Expression class
[ https://issues.apache.org/jira/browse/ARROW-6986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6986. Resolution: Fixed Issue resolved by pull request 5730 [https://github.com/apache/arrow/pull/5730] > [R] Add basic Expression class > -- > > Key: ARROW-6986 > URL: https://issues.apache.org/jira/browse/ARROW-6986 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > I started this as part of ARROW-6980 but it proved not necessary. This will > be a foundation for ARROW-6982, in addition to being useful on its own. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6998) Ability to read from URL for pyarrow's read_feather
[ https://issues.apache.org/jira/browse/ARROW-6998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960184#comment-16960184 ] Wes McKinney commented on ARROW-6998: - I think this is better as a pandas issue. We have avoided doing anything too magical in the pyarrow API > Ability to read from URL for pyarrow's read_feather > --- > > Key: ARROW-6998 > URL: https://issues.apache.org/jira/browse/ARROW-6998 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Ryan McCarthy >Priority: Major > > See this [pandas issue|https://github.com/pandas-dev/pandas/issues/29055] for > more info. Many of the pandas `read_format()` methods allow you supply a URL > except for the `read_feather()` method. This would be a nice to have feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6997) [Packaging] Add support for RHEL
[ https://issues.apache.org/jira/browse/ARROW-6997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960177#comment-16960177 ] Kouhei Sutou commented on ARROW-6997: - It seems that BinTray doesn't support symbolic link. We should create suitable {{.repo}} for each environment. > [Packaging] Add support for RHEL > > > Key: ARROW-6997 > URL: https://issues.apache.org/jira/browse/ARROW-6997 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > > We need symbolic links to {{${VERSION}Server}} from {{${VERSION}}} such as > {{7Server}} from {{7}}. (Is it available on BinTray?) > We also need to update install information. We can't install {{epel-release}} > by {{yum install -y epel-release}}. We need to specify URL explicitly: {{yum > install > https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm}}. See > https://fedoraproject.org/wiki/EPEL for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6997) [Packaging] Add support for RHEL
Kouhei Sutou created ARROW-6997: --- Summary: [Packaging] Add support for RHEL Key: ARROW-6997 URL: https://issues.apache.org/jira/browse/ARROW-6997 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou We need symbolic links to {{${VERSION}Server}} from {{${VERSION}}} such as {{7Server}} from {{7}}. (Is it available on BinTray?) We also need to update install information. We can't install {{epel-release}} by {{yum install -y epel-release}}. We need to specify URL explicitly: {{yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm}}. See https://fedoraproject.org/wiki/EPEL for details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6996) [Python] Expose boolean filter kernel on Table
[ https://issues.apache.org/jira/browse/ARROW-6996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960026#comment-16960026 ] Neal Richardson commented on ARROW-6996: I've implemented the C++ for this in ARROW-6784, just need to wrap that up and then you can add Python bindings. > [Python] Expose boolean filter kernel on Table > -- > > Key: ARROW-6996 > URL: https://issues.apache.org/jira/browse/ARROW-6996 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe Korn >Priority: Major > Labels: iceberg > > This is currently only implemented for Array but would also be useful on > Tables and ChunkedArrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6998) Ability to read from URL for pyarrow's read_feather
Ryan McCarthy created ARROW-6998: Summary: Ability to read from URL for pyarrow's read_feather Key: ARROW-6998 URL: https://issues.apache.org/jira/browse/ARROW-6998 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Ryan McCarthy See this [pandas issue|https://github.com/pandas-dev/pandas/issues/29055] for more info. Many of the pandas `read_format()` methods allow you supply a URL except for the `read_feather()` method. This would be a nice to have feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6996) [Python] Expose boolean filter kernel on Table
Uwe Korn created ARROW-6996: --- Summary: [Python] Expose boolean filter kernel on Table Key: ARROW-6996 URL: https://issues.apache.org/jira/browse/ARROW-6996 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Uwe Korn This is currently only implemented for Array but would also be useful on Tables and ChunkedArrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6988) [CI][R] Buildbot's R Conda is failing
[ https://issues.apache.org/jira/browse/ARROW-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959967#comment-16959967 ] Krisztian Szucs commented on ARROW-6988: I've managed to reproduce it locally, but I don't have time to investigate it further right now, so I've turned it off. > [CI][R] Buildbot's R Conda is failing > - > > Key: ARROW-6988 > URL: https://issues.apache.org/jira/browse/ARROW-6988 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Francois Saint-Jacques >Priority: Major > > {code:java} > Running ‘testthat.R’ > ERROR > Running the tests in ‘tests/testthat.R’ failed. > Last 13 lines of output: > 25: tryCatch(withCallingHandlers({eval(code, test_env)if (!handled > && !is.null(test)) {skip_empty()}}, expectation = > handle_expectation, skip = handle_skip, warning = handle_warning, message > = handle_message, error = handle_error), error = handle_fatal, skip = > function(e) {}) > 26: test_code(NULL, exprs, env) > 27: source_file(path, new.env(parent = env), chdir = TRUE, wrap = wrap) > 28: force(code) > 29: with_reporter(reporter = reporter, start_end_reporter = > start_end_reporter, {reporter$start_file(basename(path)) > lister$start_file(basename(path))source_file(path, new.env(parent = > env), chdir = TRUE, wrap = wrap)reporter$.end_context() > reporter$end_file()}) > 30: FUN(X[[i]], ...) > 31: lapply(paths, test_file, env = env, reporter = current_reporter, > start_end_reporter = FALSE, load_helpers = FALSE, wrap = wrap) > 32: force(code) > 33: with_reporter(reporter = current_reporter, results <- lapply(paths, > test_file, env = env, reporter = current_reporter, start_end_reporter = > FALSE, load_helpers = FALSE, wrap = wrap)) > 34: test_files(paths, reporter = reporter, env = env, stop_on_failure = > stop_on_failure, stop_on_warning = stop_on_warning, wrap = wrap) > 35: test_dir(path = test_path, reporter = reporter, env = env, filter = > filter, ..., stop_on_failure = stop_on_failure, stop_on_warning = > stop_on_warning, wrap = wrap) > 36: test_package_dir(package = package, test_path = test_path, filter = > filter, reporter = reporter, ..., stop_on_failure = stop_on_failure, > stop_on_warning = stop_on_warning, wrap = wrap) > 37: test_check("arrow") > An irrecoverable exception occurred. R is aborting now ... > Segmentation fault (core dumped) > * checking for unstated dependencies in vignettes ... OK > * checking package vignettes in ‘inst/doc’ ... OK > * checking re-building of vignette outputs ... OK > * DONE > Status: 1 ERROR, 1 WARNING, 2 NOTEs > See > ‘/buildbot/AMD64_Conda_R/r/arrow.Rcheck/00check.log’ > for details. > {code} > [|https://ci.ursalabs.org/#/builders/95] > [https://ci.ursalabs.org/#/builders/95/builds/2386] > [https://ci.ursalabs.org/#/builders/95] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6975) [C++] Put make_unique in its own header
[ https://issues.apache.org/jira/browse/ARROW-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6975: Fix Version/s: 1.0.0 > [C++] Put make_unique in its own header > --- > > Key: ARROW-6975 > URL: https://issues.apache.org/jira/browse/ARROW-6975 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Reporter: Antoine Pitrou >Priority: Major > Fix For: 1.0.0 > > > {{arrow/util/stl.h}} carries other stuff that is almost never necessary. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5679) [Python] Drop Python 3.5 from support matrix
[ https://issues.apache.org/jira/browse/ARROW-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959951#comment-16959951 ] Wes McKinney commented on ARROW-5679: - I changed the issue scope to drop Python 3.5 altogether. NumPy is about to drop 3.5 so we should too https://numpy.org/neps/nep-0029-deprecation_policy.html > [Python] Drop Python 3.5 from support matrix > > > Key: ARROW-5679 > URL: https://issues.apache.org/jira/browse/ARROW-5679 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > We probably need to maintain Python 3.5 on Linux and macOS for the time > being, but we may want to drop it for Windows since conda-forge isn't > supporting Python 3.5 anymore, so maintaining wheels for Python 3.5 will come > with extra cost -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5679) [Python] Drop Python 3.5 from support matrix
[ https://issues.apache.org/jira/browse/ARROW-5679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5679: Summary: [Python] Drop Python 3.5 from support matrix (was: [Python] Drop Python 3.5 from Windows wheel builds) > [Python] Drop Python 3.5 from support matrix > > > Key: ARROW-5679 > URL: https://issues.apache.org/jira/browse/ARROW-5679 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > We probably need to maintain Python 3.5 on Linux and macOS for the time > being, but we may want to drop it for Windows since conda-forge isn't > supporting Python 3.5 anymore, so maintaining wheels for Python 3.5 will come > with extra cost -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6995) [Packaging][Crossbow] The windows conda artifacts are not uploaded to GitHub releases
Krisztian Szucs created ARROW-6995: -- Summary: [Packaging][Crossbow] The windows conda artifacts are not uploaded to GitHub releases Key: ARROW-6995 URL: https://issues.apache.org/jira/browse/ARROW-6995 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs The artifacts should be uploaded under the appropriate tag: https://github.com/ursa-labs/crossbow/releases/tag/ursabot-289-azure-conda-win-vs2015-py37 Most certainly the artifacts are produced in a different directory than previously, so the uploading script cannot find it https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=2180 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6994) [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable
Wes McKinney created ARROW-6994: --- Summary: [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable Key: ARROW-6994 URL: https://issues.apache.org/jira/browse/ARROW-6994 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 In ARROW-6977, this was disabled on macOS, but this will potentially have negative performance and memory implications that were intended to have been fixed in ARROW-6910 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6951) [C++][Dataset] Ensure column projection is passed to ParquetDataFragment
[ https://issues.apache.org/jira/browse/ARROW-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6951: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Ensure column projection is passed to ParquetDataFragment > > > Key: ARROW-6951 > URL: https://issues.apache.org/jira/browse/ARROW-6951 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Dataset >Reporter: Francois Saint-Jacques >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959827#comment-16959827 ] Wes McKinney commented on ARROW-300: There are some discussions on going at https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959816#comment-16959816 ] Wes McKinney commented on ARROW-6985: - Really wide tables are likely causing heap fragmentation to take place, so degraded memory performance is a likely culprit but there could be something else going on. > [Python] Steadily increasing time to load file using read_parquet > - > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0, 0.15.0 >Reporter: Casey >Priority: Minor > Attachments: image-2019-10-25-14-52-46-165.png, > image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png > > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959779#comment-16959779 ] Casey commented on ARROW-6985: -- Okay looks like the wide matrix case was explained in the ticket you linked. As for the loop slowdown I'm seeing it gradually increase over time depending on the data's shape. Below are my results for a wide matrix, the wide matrix transposed, and the matrix unraveled as a column. On the number of loops I tried this with, I see about 2x in the wide and wide transpose cases though the trend of the line indicates it will continue growing. Is this expected? !image-2019-10-25-14-52-46-165.png|width=479,height=273! !image-2019-10-25-14-53-37-623.png|width=483,height=253! !image-2019-10-25-14-54-32-583.png|width=462,height=246! > [Python] Steadily increasing time to load file using read_parquet > - > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0, 0.15.0 >Reporter: Casey >Priority: Minor > Attachments: image-2019-10-25-14-52-46-165.png, > image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png > > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Casey updated ARROW-6985: - Attachment: image-2019-10-25-14-54-32-583.png > [Python] Steadily increasing time to load file using read_parquet > - > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0, 0.15.0 >Reporter: Casey >Priority: Minor > Attachments: image-2019-10-25-14-52-46-165.png, > image-2019-10-25-14-53-37-623.png, image-2019-10-25-14-54-32-583.png > > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Casey updated ARROW-6985: - Attachment: image-2019-10-25-14-53-37-623.png > [Python] Steadily increasing time to load file using read_parquet > - > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0, 0.15.0 >Reporter: Casey >Priority: Minor > Attachments: image-2019-10-25-14-52-46-165.png, > image-2019-10-25-14-53-37-623.png > > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Casey updated ARROW-6985: - Attachment: image-2019-10-25-14-52-46-165.png > [Python] Steadily increasing time to load file using read_parquet > - > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0, 0.15.0 >Reporter: Casey >Priority: Minor > Attachments: image-2019-10-25-14-52-46-165.png > > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6952) [C++][Dataset] Ensure expression filter is passed ParquetDataFragment
[ https://issues.apache.org/jira/browse/ARROW-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques reassigned ARROW-6952: - Assignee: Francois Saint-Jacques (was: Ben Kietzman) > [C++][Dataset] Ensure expression filter is passed ParquetDataFragment > - > > Key: ARROW-6952 > URL: https://issues.apache.org/jira/browse/ARROW-6952 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Dataset >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset > > We should be able to prune RowGroups based on the expression and the > statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC
[ https://issues.apache.org/jira/browse/ARROW-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gawain BOLTON updated ARROW-6992: - Priority: Minor (was: Major) > [C++]: Undefined Behavior sanitizer build option fails with GCC > --- > > Key: ARROW-6992 > URL: https://issues.apache.org/jira/browse/ARROW-6992 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Gawain BOLTON >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Then build with the "undefined behaviour sanitizer" option > (-DARROW_USE_UBSAN=ON) the compilation fails with GCC: > {noformat} > c++: error: unrecognized argument to ‘-fno-sanitize=’ option: > ‘function’{noformat} > It appears that GCC has never had a "-fsanitize=function" option. > I have fixed this issue and will submit a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC
[ https://issues.apache.org/jira/browse/ARROW-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gawain BOLTON updated ARROW-6992: - Issue Type: Bug (was: Improvement) > [C++]: Undefined Behavior sanitizer build option fails with GCC > --- > > Key: ARROW-6992 > URL: https://issues.apache.org/jira/browse/ARROW-6992 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Gawain BOLTON >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Then build with the "undefined behaviour sanitizer" option > (-DARROW_USE_UBSAN=ON) the compilation fails with GCC: > {noformat} > c++: error: unrecognized argument to ‘-fno-sanitize=’ option: > ‘function’{noformat} > It appears that GCC has never had a "-fsanitize=function" option. > I have fixed this issue and will submit a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6966) [Go] 32bit memset is null
[ https://issues.apache.org/jira/browse/ARROW-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6966. - Resolution: Fixed Issue resolved by pull request 5714 [https://github.com/apache/arrow/pull/5714] > [Go] 32bit memset is null > - > > Key: ARROW-6966 > URL: https://issues.apache.org/jira/browse/ARROW-6966 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Jonathan A Sternberg >Assignee: Jonathan A Sternberg >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > If you use a function that calls `memset.Set`, the implementation on a 32 bit > machine seems to be unset. This happened in our 32 bit build here: > [https://circleci.com/gh/influxdata/influxdb/66112#tests/containers/2] > {code:java} > goroutine 66 [running]:goroutine 66 > [running]:testing.tRunner.func1(0x9e1f2c0) > /usr/local/go/src/testing/testing.go:830 +0x30epanic(0x899cb40, 0x9403c40) > /usr/local/go/src/runtime/panic.go:522 > +0x16egithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory.Set(...) > > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/memory/memory.go:25github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).init(0x9e44990, > 0x20) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:101 > > +0xc7github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).init(0x9e44990, > 0x20) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:102 > > +0x2fgithub.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Resize(0x9e44990, > 0x2) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:125 > > +0x42github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*builder).reserve(0x9e44990, > 0x1, 0x9c52464) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/builder.go:138 > > +0x72github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array.(*Int64Builder).Reserve(0x9e44990, > 0x1) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/apache/arrow/go/arrow/array/numericbuilder.gen.go:113 > > +0x51github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow.NewInt(0x9e4a770, > 0x1, 0x1, 0x0, 0x89f0360) > /root/go/src/github.com/influxdata/influxdb/vendor/github.com/influxdata/flux/arrow/int.go:10 > > +0x6cgithub.com/influxdata/influxdb/storage/reads.(*floatTable).advance(0x9e42070, > 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:91 > +0x7egithub.com/influxdata/influxdb/storage/reads.newFloatTable(0x9e17740, > 0xe521a160, 0x9e1b8c0, 0x0, 0x0, 0x1e, 0x0, 0x8c13be0, 0x9e448a0, 0x9e448d0, > ...) > /root/go/src/github.com/influxdata/influxdb/storage/reads/table.gen.go:47 > +0x1c2github.com/influxdata/influxdb/storage/reads.(*filterIterator).handleRead(0x9e22840, > 0x9e0d1a0, 0x8c0ce00, 0x9e48780, 0x0, 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:177 > +0x755github.com/influxdata/influxdb/storage/reads.(*filterIterator).Do(0x9e22840, > 0x9e0d170, 0x9c40070, 0x0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader.go:140 > +0x138github.com/influxdata/influxdb/storage/reads_test.TestDuplicateKeys_ReadFilter(0x9e1f2c0) > /root/go/src/github.com/influxdata/influxdb/storage/reads/reader_test.go:89 > +0x1dftesting.tRunner(0x9e1f2c0, 0x8ad44e4) > /usr/local/go/src/testing/testing.go:865 +0x97created by testing.(*T).Run > /usr/local/go/src/testing/testing.go:916 +0x2b2 > {code} > I added a print statement at where memset happened to print the function that > was being used and got this: > {code} > [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] 0 > {code} > If I set {{memset}} with a default, the code that calls into this works fine. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC
[ https://issues.apache.org/jira/browse/ARROW-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6992: -- Labels: pull-request-available (was: ) > [C++]: Undefined Behavior sanitizer build option fails with GCC > --- > > Key: ARROW-6992 > URL: https://issues.apache.org/jira/browse/ARROW-6992 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Gawain BOLTON >Priority: Major > Labels: pull-request-available > > Then build with the "undefined behaviour sanitizer" option > (-DARROW_USE_UBSAN=ON) the compilation fails with GCC: > {noformat} > c++: error: unrecognized argument to ‘-fno-sanitize=’ option: > ‘function’{noformat} > It appears that GCC has never had a "-fsanitize=function" option. > I have fixed this issue and will submit a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6989) [Python][C++] Assert is triggered when decimal type inference occurs on a value with out of range precision
[ https://issues.apache.org/jira/browse/ARROW-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6989: -- Labels: pull-request-available (was: ) > [Python][C++] Assert is triggered when decimal type inference occurs on a > value with out of range precision > --- > > Key: ARROW-6989 > URL: https://issues.apache.org/jira/browse/ARROW-6989 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Micah Kornfield >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > > Example: > pa.array([decimal.Decimal(123.234)] ) > > The problem is that inference.cc calls the direct constructor for decimal > types instead using Make. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6989) [Python][C++] Assert is triggered when decimal type inference occurs on a value with out of range precision
[ https://issues.apache.org/jira/browse/ARROW-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-6989: Assignee: Joris Van den Bossche > [Python][C++] Assert is triggered when decimal type inference occurs on a > value with out of range precision > --- > > Key: ARROW-6989 > URL: https://issues.apache.org/jira/browse/ARROW-6989 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Micah Kornfield >Assignee: Joris Van den Bossche >Priority: Major > > Example: > pa.array([decimal.Decimal(123.234)] ) > > The problem is that inference.cc calls the direct constructor for decimal > types instead using Make. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6993) [CI] Macos SDK installation fails on Travis
[ https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-6993: --- Summary: [CI] Macos SDK installation fails on Travis (was: [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis) > [CI] Macos SDK installation fails on Travis > > > Key: ARROW-6993 > URL: https://issues.apache.org/jira/browse/ARROW-6993 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6993) [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis
[ https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-6993: --- Issue Type: Bug (was: Improvement) > [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis > -- > > Key: ARROW-6993 > URL: https://issues.apache.org/jira/browse/ARROW-6993 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6993) [CI] Macos SDK installation fails on Travis
[ https://issues.apache.org/jira/browse/ARROW-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-6993: --- Description: See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 Pass -allowUntrasted flag during the installation. was:See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 > [CI] Macos SDK installation fails on Travis > > > Key: ARROW-6993 > URL: https://issues.apache.org/jira/browse/ARROW-6993 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 > Pass -allowUntrasted flag during the installation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6993) [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis
Krisztian Szucs created ARROW-6993: -- Summary: [CI] Pass -allowUntrasted flag during Macos SDK installation on Travis Key: ARROW-6993 URL: https://issues.apache.org/jira/browse/ARROW-6993 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Krisztian Szucs See the failing build at https://travis-ci.org/apache/arrow/jobs/602560324 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959653#comment-16959653 ] Joris Van den Bossche commented on ARROW-6985: -- [~CHDev93] thanks for the report. There was a performance regression regarding parquet files with many columns in 0.15.0 (see ARROW-6876, fixed on master and will shortly be released as 0.15.1). So that could clarify at least a general slowdown. How much do you see it slow down during the loop? I ran your code and possibly see some slowdown (max 2x), but it's a bit noisy. > [Python] Steadily increasing time to load file using read_parquet > - > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.13.0, 0.14.0, 0.15.0 >Reporter: Casey >Priority: Minor > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6876) [Python] Reading parquet file with many columns becomes slow for 0.15.0
[ https://issues.apache.org/jira/browse/ARROW-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6876: - Summary: [Python] Reading parquet file with many columns becomes slow for 0.15.0 (was: [Python] Reading parquet file becomes really slow for 0.15.0) > [Python] Reading parquet file with many columns becomes slow for 0.15.0 > --- > > Key: ARROW-6876 > URL: https://issues.apache.org/jira/browse/ARROW-6876 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: python3.7 >Reporter: Bob >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Attachments: image-2019-10-14-18-10-42-850.png, > image-2019-10-14-18-12-07-652.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Hi, > > I just noticed that reading a parquet file becomes really slow after I > upgraded to 0.15.0 when using pandas. > > Example: > *With 0.14.1* > In [4]: %timeit df = pd.read_parquet(path) > 2.02 s ± 47.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > *With 0.15.0* > In [5]: %timeit df = pd.read_parquet(path) > 22.9 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > The file is about 15MB in size. I am testing on the same machine using the > same version of python and pandas. > > Have you received similar complain? What could be the issue here? > > Thanks a lot. > > > Edit1: > Some profiling I did: > 0.14.1: > !image-2019-10-14-18-12-07-652.png! > > 0.15.0: > !image-2019-10-14-18-10-42-850.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6992) [C++]: Undefined Behavior sanitizer build option fails with GCC
Gawain BOLTON created ARROW-6992: Summary: [C++]: Undefined Behavior sanitizer build option fails with GCC Key: ARROW-6992 URL: https://issues.apache.org/jira/browse/ARROW-6992 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Gawain BOLTON Then build with the "undefined behaviour sanitizer" option (-DARROW_USE_UBSAN=ON) the compilation fails with GCC: {noformat} c++: error: unrecognized argument to ‘-fno-sanitize=’ option: ‘function’{noformat} It appears that GCC has never had a "-fsanitize=function" option. I have fixed this issue and will submit a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6984) [C++] Update LZ4 to 1.9.2 for CVE-2019-17543
[ https://issues.apache.org/jira/browse/ARROW-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959595#comment-16959595 ] Antoine Pitrou commented on ARROW-6984: --- Also we call none of the functions mentioned in https://nvd.nist.gov/vuln/detail/CVE-2019-17543 ({{LZ4_write32}}, {{LZ4_compress_destSize}}, {{LZ4_compress_fast}}). > [C++] Update LZ4 to 1.9.2 for CVE-2019-17543 > > > Key: ARROW-6984 > URL: https://issues.apache.org/jira/browse/ARROW-6984 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.15.0 >Reporter: Sangeeth Keeriyadath >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > There is a reported CVE that LZ4 before 1.9.2 has a heap-based buffer > overflow in LZ4_write32 (More details in here - > [https://nvd.nist.gov/vuln/detail/CVE-2019-17543] ). I see that Apache Arrow > uses *v1.8.3* version ( > [https://github.com/apache/arrow/blob/47e5ecafa72b70112a64a1174b29b9db45f803ef/cpp/thirdparty/versions.txt#L38] > ). > We need to bump up the dependency version of LZ4 to *1.9.2* to get past the > reported CVE. Thank you! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6984) [C++] Update LZ4 to 1.9.2 for CVE-2019-17543
[ https://issues.apache.org/jira/browse/ARROW-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959592#comment-16959592 ] Antoine Pitrou commented on ARROW-6984: --- According to https://github.com/lz4/lz4/issues/801: > the conditions required to trigger it are not trivial. Actually, in most > systems, including the lz4 frame format and API, the bug is just out of reach. > since it also requires multiple uncommon constraints on the encoder side, > which are out of direct control from an external actor (in contrast with the > payload), this bug is rarely "reachable", making it a poor exploit vector. So this doesn't sound mandatory for 0.15.1. > [C++] Update LZ4 to 1.9.2 for CVE-2019-17543 > > > Key: ARROW-6984 > URL: https://issues.apache.org/jira/browse/ARROW-6984 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.15.0 >Reporter: Sangeeth Keeriyadath >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > There is a reported CVE that LZ4 before 1.9.2 has a heap-based buffer > overflow in LZ4_write32 (More details in here - > [https://nvd.nist.gov/vuln/detail/CVE-2019-17543] ). I see that Apache Arrow > uses *v1.8.3* version ( > [https://github.com/apache/arrow/blob/47e5ecafa72b70112a64a1174b29b9db45f803ef/cpp/thirdparty/versions.txt#L38] > ). > We need to bump up the dependency version of LZ4 to *1.9.2* to get past the > reported CVE. Thank you! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-6976) Possible memory leak in pyarrow read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-6976. Resolution: Duplicate > Possible memory leak in pyarrow read_parquet > > > Key: ARROW-6976 > URL: https://issues.apache.org/jira/browse/ARROW-6976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: linux ubuntu 18.04 >Reporter: david cottrell >Priority: Critical > Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, > pyarrow_0150.png > > > > Version and repro info in the gist below. > Not sure if I'm not understanding something from this > [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/] > but there seems to be memory accumulation when that is exacerbated with > higher arity objects like strings and dates (not datetimes). > > I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed > to "fix" or lessen the problem. > > [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62] > > Let me know if this post should go elsewhere. > !image-2019-10-23-16-17-20-739.png! > > {code:java} > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6874) [Python] Memory leak in Table.to_pandas() when conversion to object dtype
[ https://issues.apache.org/jira/browse/ARROW-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6874: - Summary: [Python] Memory leak in Table.to_pandas() when conversion to object dtype (was: [Python] Memory leak in Table.to_pandas() when nested columns are present) > [Python] Memory leak in Table.to_pandas() when conversion to object dtype > - > > Key: ARROW-6874 > URL: https://issues.apache.org/jira/browse/ARROW-6874 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: Operating system: Windows 10 > pyarrow installed via conda > both python environments were identical except pyarrow: > python: 3.6.7 > numpy: 1.17.2 > pandas: 0.25.1 >Reporter: Sergey Mozharov >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0, 0.15.1 > > Time Spent: 1.5h > Remaining Estimate: 0h > > I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python > interpreter ran out of memory. > I narrowed the issue down to the pyarrow.Table.to_pandas() call, which > appears to have a memory leak in the latest version. See details below to > reproduce this issue. > > {code:java} > import numpy as np > import pandas as pd > import pyarrow as pa > # create a table with one nested array column > nested_array = pa.array([np.random.rand(1000) for i in range(500)]) > nested_array.type # ListType(list) > table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays']) > # convert it to a pandas DataFrame in a loop to monitor memory consumption > num_iterations = 1 > # pyarrow v0.14.1: Memory allocation does not grow during loop execution > # pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected > for i in range(num_iterations): > df = pa.Table.to_pandas(table) > # When the table column is not nested, no memory leak is observed > array = pa.array(np.random.rand(500 * 1000)) > table = pa.Table.from_arrays(arrays=[array], names=['numbers']) > # no memory leak: > for i in range(num_iterations): > df = pa.Table.to_pandas(table){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6976) Possible memory leak in pyarrow read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959577#comment-16959577 ] Joris Van den Bossche commented on ARROW-6976: -- Yes, I can confirm it is fixed. Running your script on 0.15.0 vs master: !pyarrow_0150.png! !pyarrow-master.png! > Possible memory leak in pyarrow read_parquet > > > Key: ARROW-6976 > URL: https://issues.apache.org/jira/browse/ARROW-6976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: linux ubuntu 18.04 >Reporter: david cottrell >Priority: Critical > Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, > pyarrow_0150.png > > > > Version and repro info in the gist below. > Not sure if I'm not understanding something from this > [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/] > but there seems to be memory accumulation when that is exacerbated with > higher arity objects like strings and dates (not datetimes). > > I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed > to "fix" or lessen the problem. > > [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62] > > Let me know if this post should go elsewhere. > !image-2019-10-23-16-17-20-739.png! > > {code:java} > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6976) Possible memory leak in pyarrow read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6976: - Attachment: pyarrow-master.png > Possible memory leak in pyarrow read_parquet > > > Key: ARROW-6976 > URL: https://issues.apache.org/jira/browse/ARROW-6976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: linux ubuntu 18.04 >Reporter: david cottrell >Priority: Critical > Attachments: image-2019-10-23-16-17-20-739.png, pyarrow-master.png, > pyarrow_0150.png > > > > Version and repro info in the gist below. > Not sure if I'm not understanding something from this > [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/] > but there seems to be memory accumulation when that is exacerbated with > higher arity objects like strings and dates (not datetimes). > > I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed > to "fix" or lessen the problem. > > [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62] > > Let me know if this post should go elsewhere. > !image-2019-10-23-16-17-20-739.png! > > {code:java} > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6976) Possible memory leak in pyarrow read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-6976: - Attachment: pyarrow_0150.png > Possible memory leak in pyarrow read_parquet > > > Key: ARROW-6976 > URL: https://issues.apache.org/jira/browse/ARROW-6976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: linux ubuntu 18.04 >Reporter: david cottrell >Priority: Critical > Attachments: image-2019-10-23-16-17-20-739.png, pyarrow_0150.png > > > > Version and repro info in the gist below. > Not sure if I'm not understanding something from this > [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/] > but there seems to be memory accumulation when that is exacerbated with > higher arity objects like strings and dates (not datetimes). > > I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed > to "fix" or lessen the problem. > > [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62] > > Let me know if this post should go elsewhere. > !image-2019-10-23-16-17-20-739.png! > > {code:java} > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959552#comment-16959552 ] Yuan Zhou commented on ARROW-300: - Hi [~wesm], thanks for providing the general idea, I'm quite interested in this feature. Do you happen to have some updates on the detail proposal? Cheers, -yuan > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6976) Possible memory leak in pyarrow read_parquet
[ https://issues.apache.org/jira/browse/ARROW-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959547#comment-16959547 ] Joris Van den Bossche commented on ARROW-6976: -- [~cottrell] thanks for the report! I _suppose_ this is the same as what was reported in ARROW-6874. That was about the conversion of Arrow tables to pandas involving object dtypes (so not the actual parquet code). That issue is fixed on master, and will also be shortly released in 0.15.1 > Possible memory leak in pyarrow read_parquet > > > Key: ARROW-6976 > URL: https://issues.apache.org/jira/browse/ARROW-6976 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.0 > Environment: linux ubuntu 18.04 >Reporter: david cottrell >Priority: Critical > Attachments: image-2019-10-23-16-17-20-739.png > > > > Version and repro info in the gist below. > Not sure if I'm not understanding something from this > [https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/] > but there seems to be memory accumulation when that is exacerbated with > higher arity objects like strings and dates (not datetimes). > > I was not able to reproduce the issue on MacOS. Downgrading to 0.14.1 seemed > to "fix" or lessen the problem. > > [https://gist.github.com/cottrell/a3f95aa59408d87f925ec606d8783e62] > > Let me know if this post should go elsewhere. > !image-2019-10-23-16-17-20-739.png! > > {code:java} > > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6991) [Packaging][deb] Add support for Ubuntu 19.10
[ https://issues.apache.org/jira/browse/ARROW-6991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6991: -- Labels: pull-request-available (was: ) > [Packaging][deb] Add support for Ubuntu 19.10 > - > > Key: ARROW-6991 > URL: https://issues.apache.org/jira/browse/ARROW-6991 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6991) [Packaging][deb] Add support for Ubuntu 19.10
Kouhei Sutou created ARROW-6991: --- Summary: [Packaging][deb] Add support for Ubuntu 19.10 Key: ARROW-6991 URL: https://issues.apache.org/jira/browse/ARROW-6991 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.3.4#803005)