[jira] [Assigned] (ARROW-5882) [C++][Gandiva] Throw error if divisor is 0 in integer mod functions
[ https://issues.apache.org/jira/browse/ARROW-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prudhvi Porandla reassigned ARROW-5882: --- Assignee: Projjal Chanda (was: Prudhvi Porandla) > [C++][Gandiva] Throw error if divisor is 0 in integer mod functions > > > Key: ARROW-5882 > URL: https://issues.apache.org/jira/browse/ARROW-5882 > Project: Apache Arrow > Issue Type: Bug >Reporter: Prudhvi Porandla >Assignee: Projjal Chanda >Priority: Minor > > mod_int64_int32, mod_int64_int64 should throw an error when divisor is 0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7099) [C++] Disambiguate function calls in csv parser test
[ https://issues.apache.org/jira/browse/ARROW-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7099: -- Labels: pull-request-available (was: ) > [C++] Disambiguate function calls in csv parser test > > > Key: ARROW-7099 > URL: https://issues.apache.org/jira/browse/ARROW-7099 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > Labels: pull-request-available > > cpp/src/arrow/csv/parser_test.cc has calls to overloaded functions which > cannot be disambiguated. see https://github.com/apache/arrow/pull/5727 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7099) [C++] Disambiguate function calls in csv parser test
[ https://issues.apache.org/jira/browse/ARROW-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prudhvi Porandla updated ARROW-7099: Description: cpp/src/arrow/csv/parser_test.cc has calls to overloaded functions which cannot be disambiguated. see https://github.com/apache/arrow/pull/5727 > [C++] Disambiguate function calls in csv parser test > > > Key: ARROW-7099 > URL: https://issues.apache.org/jira/browse/ARROW-7099 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Minor > > cpp/src/arrow/csv/parser_test.cc has calls to overloaded functions which > cannot be disambiguated. see https://github.com/apache/arrow/pull/5727 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7099) [C++] Disambiguate function calls in csv parser test
Prudhvi Porandla created ARROW-7099: --- Summary: [C++] Disambiguate function calls in csv parser test Key: ARROW-7099 URL: https://issues.apache.org/jira/browse/ARROW-7099 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7098) [Java] Improve the performance of comparing two memory blocks
[ https://issues.apache.org/jira/browse/ARROW-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7098: -- Labels: pull-request-available (was: ) > [Java] Improve the performance of comparing two memory blocks > - > > Key: ARROW-7098 > URL: https://issues.apache.org/jira/browse/ARROW-7098 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > > We often use the 8-4-1 paradigm to compare two blocks of memory: > 1. First compare by 8-byte blocks in a loop > 2. Then compare by 4-byte blocks in a loop > 3. Last compare by 1-byte blocks in a loop > It can be proved that the second loop runs at most once. So we can replace > the loop with a if statement, which will save us a comparison and two jump > operations. > According to the discussion in > https://github.com/apache/arrow/pull/5508#discussion_r343973982, loop can be > expensive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7098) [Java] Improve the performance of comparing two memory blocks
Liya Fan created ARROW-7098: --- Summary: [Java] Improve the performance of comparing two memory blocks Key: ARROW-7098 URL: https://issues.apache.org/jira/browse/ARROW-7098 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan We often use the 8-4-1 paradigm to compare two blocks of memory: 1. First compare by 8-byte blocks in a loop 2. Then compare by 4-byte blocks in a loop 3. Last compare by 1-byte blocks in a loop It can be proved that the second loop runs at most once. So we can replace the loop with a if statement, which will save us a comparison and two jump operations. According to the discussion in https://github.com/apache/arrow/pull/5508#discussion_r343973982, loop can be expensive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6911) [Java] Provide composite comparator
[ https://issues.apache.org/jira/browse/ARROW-6911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6911. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5678 [https://github.com/apache/arrow/pull/5678] > [Java] Provide composite comparator > --- > > Key: ARROW-6911 > URL: https://issues.apache.org/jira/browse/ARROW-6911 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > A composite comparator is a sub-class of VectorValueComparator that contains > an array of inner comparators, with each comparator corresponding to one > column for comparison. It can be used to support sort/comparison operations > for VectorSchemaRoot/StructVector. > The composite comparator works like this: it first uses the first internal > comparator (for the primary sort key) to compare vector values. If it gets a > non-zero value, we just return it; otherwise, we use the second comparator to > break the tie, and so on, until a non-zero value is produced by some internal > comparator, or all internal comparators have been used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7020) [Java] Fix the bugs when calculating vector hash code
[ https://issues.apache.org/jira/browse/ARROW-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-7020. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5752 [https://github.com/apache/arrow/pull/5752] > [Java] Fix the bugs when calculating vector hash code > - > > Key: ARROW-7020 > URL: https://issues.apache.org/jira/browse/ARROW-7020 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > When calculating the hash code for a value in the vector, the validity bit > must be taken into account. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7097) [Rust][CI] Builds failing due to rust nightly
Wes McKinney created ARROW-7097: --- Summary: [Rust][CI] Builds failing due to rust nightly Key: ARROW-7097 URL: https://issues.apache.org/jira/browse/ARROW-7097 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Wes McKinney Fix For: 1.0.0 see e.g. https://github.com/apache/arrow/runs/293573608 on master -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels
[ https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969762#comment-16969762 ] Wes McKinney commented on ARROW-7083: - Note that no query engine development has been done so that design document is simply a proposal until actual work happens. > [C++] Determine the feasibility and build a prototype to replace > compute/kernels with gandiva kernels > - > > Key: ARROW-7083 > URL: https://issues.apache.org/jira/browse/ARROW-7083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute, C++ - Gandiva >Reporter: Micah Kornfield >Priority: Major > > See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] > > Requirements: > 1. No hard runtime dependency on LLVM > 2. Ability to run without LLVM static/shared libraries. > > Open questions: > 1. What dependencies does this add to the build tool chain? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7096) [C++] Add options structs for concatenation-with-promotion and schema unification
Wes McKinney created ARROW-7096: --- Summary: [C++] Add options structs for concatenation-with-promotion and schema unification Key: ARROW-7096 URL: https://issues.apache.org/jira/browse/ARROW-7096 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 1.0.0 Follow up to ARROW-6625 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7091) [C++] Move all factories to type_fwd.h
[ https://issues.apache.org/jira/browse/ARROW-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969750#comment-16969750 ] Wes McKinney commented on ARROW-7091: - +1 > [C++] Move all factories to type_fwd.h > -- > > Key: ARROW-7091 > URL: https://issues.apache.org/jira/browse/ARROW-7091 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Antoine Pitrou >Priority: Minor > Fix For: 1.0.0 > > > There's no particular reason why parameter-less factories are in > {{type_fwd.h}}, but the others in their respective implementation headers. By > putting more factories in {{type_fwd.h}}, we may be able to avoid importing > the heavier headers in some places. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7088) [C++][Python] gcc 4.8 / wheel builds failing after PARQUET-1678
[ https://issues.apache.org/jira/browse/ARROW-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-7088. --- Fix Version/s: (was: 1.0.0) Resolution: Duplicate Closing in favor of PARQUET-1688 > [C++][Python] gcc 4.8 / wheel builds failing after PARQUET-1678 > --- > > Key: ARROW-7088 > URL: https://issues.apache.org/jira/browse/ARROW-7088 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Wes McKinney >Priority: Blocker > > See > https://travis-ci.org/ursa-labs/crossbow/builds/608629511?utm_source=github_status_medium=notification > {code} > /usr/bin/ccache /opt/rh/devtoolset-2/root/usr/bin/c++ -DARROW_JEMALLOC > -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_GLOG -DARROW_USE_SIMD > -DARROW_WITH_ZSTD -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DHAVE_NETINET_IN_H > -DPARQUET_EXPORTING -DPARQUET_USE_BOOST_REGEX -Isrc -I/arrow/cpp/src > -I/arrow/cpp/src/generated -isystem /arrow/cpp/thirdparty/flatbuffers/include > -isystem /arrow_boost_dist/include -isystem /usr/local/include -isystem > jemalloc_ep-prefix/src -isystem /arrow/cpp/thirdparty/hadoop/include -O3 > -DNDEBUG -Wall -Wno-attributes -msse4.2 -O3 -DNDEBUG -fPIC -std=gnu++11 > -MD -MT src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -MF > src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o.d -o > src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -c > /arrow/cpp/src/parquet/stream_reader.cc > In file included from /arrow/cpp/src/parquet/stream_reader.h:31:0, > from /arrow/cpp/src/parquet/stream_reader.cc:18: > /arrow/cpp/src/parquet/stream_writer.h:67:17: error: function > ‘parquet::StreamWriter& > parquet::StreamWriter::operator=(parquet::StreamWriter&&)’ defaulted on its > first declaration with an exception-specification that differs from the > implicit declaration ‘parquet::StreamWriter& > parquet::StreamWriter::operator=(parquet::StreamWriter&&)’ >StreamWriter& operator=(StreamWriter&&) noexcept = default; > ^ > In file included from /arrow/cpp/src/parquet/stream_reader.cc:18:0: > /arrow/cpp/src/parquet/stream_reader.h:61:17: error: function > ‘parquet::StreamReader& > parquet::StreamReader::operator=(parquet::StreamReader&&)’ defaulted on its > first declaration with an exception-specification that differs from the > implicit declaration ‘parquet::StreamReader& > parquet::StreamReader::operator=(parquet::StreamReader&&)’ >StreamReader& operator=(StreamReader&&) noexcept = default; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
[ https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969714#comment-16969714 ] Micah Kornfield commented on ARROW-4890: Yes. I believe it is 2GB per shard currently. > [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1 > - > > Key: ARROW-4890 > URL: https://issues.apache.org/jira/browse/ARROW-4890 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Cloudera cdh5.13.3 > Cloudera Spark 2.3.0.cloudera3 >Reporter: Abdeali Kothari >Priority: Major > Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png > > > Creating this in Arrow project as the traceback seems to suggest this is an > issue in Arrow. > Continuation from the conversation on the > https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E > When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error: > {noformat} > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", > line 279, in load_stream > for batch in reader: > File "pyarrow/ipc.pxi", line 265, in __iter__ > File "pyarrow/ipc.pxi", line 281, in > pyarrow.lib._RecordBatchReader.read_next_batch > File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: read length must be positive or -1 > {noformat} > as my dataset size starts increasing that I want to group on. Here is a > reproducible code snippet where I can reproduce this. > Note: My actual dataset is much larger and has many more unique IDs and is a > valid usecase where I cannot simplify this groupby in any way. I have > stripped out all the logic to make this example as simple as I could. > {code:java} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell' > import findspark > findspark.init() > import pyspark > from pyspark.sql import functions as F, types as T > import pandas as pd > spark = pyspark.sql.SparkSession.builder.getOrCreate() > pdf1 = pd.DataFrame( > [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]], > columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4'] > ) > df1 = spark.createDataFrame(pd.concat([pdf1 for i in > range(429)]).reset_index()).drop('index') > pdf2 = pd.DataFrame( > [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", > "abcdefghijklmno"]], > columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6'] > ) > df2 = spark.createDataFrame(pd.concat([pdf2 for i in > range(48993)]).reset_index()).drop('index') > df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner') > def myudf(df): > return df > df4 = df3 > udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf) > df5 = df4.groupBy('df1_c1').apply(udf) > print('df5.count()', df5.count()) > # df5.write.parquet('/tmp/temp.parquet', mode='overwrite') > {code} > I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per > executor too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969712#comment-16969712 ] Micah Kornfield commented on ARROW-1644: The code isn't really super useable since it is based on the old repo and a lot of changes have been made (and it had a performance regression). I haven't had time to work on this, but still hope to get some bandwidth in the next month or so. But if there are motivated parties I'm happy to remove my name from the assignment. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels
[ https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969709#comment-16969709 ] Micah Kornfield commented on ARROW-7083: [~yuanzhou] that is what this Jira is about. Currently all the kernels are 100% C++ and don't use Gandiva. The question is how feasible is it to reuse Gandiva kernels in a non-JIT environment. It would be nice to not duplicate code but in some contexts JIT isn't an option. > [C++] Determine the feasibility and build a prototype to replace > compute/kernels with gandiva kernels > - > > Key: ARROW-7083 > URL: https://issues.apache.org/jira/browse/ARROW-7083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute, C++ - Gandiva >Reporter: Micah Kornfield >Priority: Major > > See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] > > Requirements: > 1. No hard runtime dependency on LLVM > 2. Ability to run without LLVM static/shared libraries. > > Open questions: > 1. What dependencies does this add to the build tool chain? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels
[ https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969683#comment-16969683 ] Yuan Zhou commented on ARROW-7083: -- Hi [~emkornfi...@gmail.com] For the coming AQE what kernels will Arrows use, Is it using 100% C++ kernels? or a combination of C++ and Gandiva kernels? In the design draft there seems to make a combination of these two kernels. https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit#heading=h.2k6k5a4y9b8y Cheers, -yuan > [C++] Determine the feasibility and build a prototype to replace > compute/kernels with gandiva kernels > - > > Key: ARROW-7083 > URL: https://issues.apache.org/jira/browse/ARROW-7083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute, C++ - Gandiva >Reporter: Micah Kornfield >Priority: Major > > See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] > > Requirements: > 1. No hard runtime dependency on LLVM > 2. Ability to run without LLVM static/shared libraries. > > Open questions: > 1. What dependencies does this add to the build tool chain? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7095) [R] Better handling of unsupported filter expression in dplyr methods
Neal Richardson created ARROW-7095: -- Summary: [R] Better handling of unsupported filter expression in dplyr methods Key: ARROW-7095 URL: https://issues.apache.org/jira/browse/ARROW-7095 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 Followup to ARROW-6340. Consider erroring instead of calling `collect()` on a Dataset and filtering in R. Or see if there's a safer way to defer evaluation that may allow for less data to be pulled down to R to filter after. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7094) [R] Change FileSystem access in Datasets to shared_ptr
Neal Richardson created ARROW-7094: -- Summary: [R] Change FileSystem access in Datasets to shared_ptr Key: ARROW-7094 URL: https://issues.apache.org/jira/browse/ARROW-7094 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 Followup to ARROW-6340 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7094) [R] Change FileSystem access in Datasets to shared_ptr
[ https://issues.apache.org/jira/browse/ARROW-7094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7094: --- Component/s: C++ - Dataset > [R] Change FileSystem access in Datasets to shared_ptr > -- > > Key: ARROW-7094 > URL: https://issues.apache.org/jira/browse/ARROW-7094 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Dataset, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > Followup to ARROW-6340 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7093) [R] Support creating ScalarExpressions for more data types
Neal Richardson created ARROW-7093: -- Summary: [R] Support creating ScalarExpressions for more data types Key: ARROW-7093 URL: https://issues.apache.org/jira/browse/ARROW-7093 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Fix For: 1.0.0 ARROW-6340 was limited to integer/double/logical. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7092) [R] Add vignette for dplyr and datasets
Neal Richardson created ARROW-7092: -- Summary: [R] Add vignette for dplyr and datasets Key: ARROW-7092 URL: https://issues.apache.org/jira/browse/ARROW-7092 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 1.0.0 Followup to ARROW-6340 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7082) [Packaging][deb] Add apache-arrow-archive-keyring
[ https://issues.apache.org/jira/browse/ARROW-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-7082. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5786 [https://github.com/apache/arrow/pull/5786] > [Packaging][deb] Add apache-arrow-archive-keyring > - > > Key: ARROW-7082 > URL: https://issues.apache.org/jira/browse/ARROW-7082 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6792) [R] Explore roxygen2 R6 class documentation
[ https://issues.apache.org/jira/browse/ARROW-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969593#comment-16969593 ] Neal Richardson commented on ARROW-6792: I took a look at this in the course of writing documentation for ARROW-6340. Some observations: * It's all or nothing. If we use the new roxygen at all, we have to update all of our existing docs. In that PR I added `r6 = FALSE` to the RoxygenNote to keep the old behavior for now. * The first disqualifying feature I noticed is the that the new R6 stuff doesn't like how we documented several classes in the same file. It just repeats "Super classes" and "Methods" sections down the page. See https://github.com/r-lib/roxygen2/issues/961. * Bad crossreferences (reported https://github.com/r-lib/pkgdown/issues/1177) > [R] Explore roxygen2 R6 class documentation > --- > > Key: ARROW-6792 > URL: https://issues.apache.org/jira/browse/ARROW-6792 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > roxygen2 version 7.0 adds support for documenting R6 classes, rather than the > ad hoc approach we've had to take without it: > [https://github.com/r-lib/roxygen2/blob/master/vignettes/rd.Rmd#L203] > Try it out and see how we like it, and consider refactoring the docs to use > it everywhere. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7084) [C++] ArrayRangeEquals should check for full type equality?
[ https://issues.apache.org/jira/browse/ARROW-7084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969576#comment-16969576 ] Uwe Korn commented on ARROW-7084: - It was an oversight when fixing ARROW-2567 , we should also fix ArrayRangeEquals. > [C++] ArrayRangeEquals should check for full type equality? > > > Key: ARROW-7084 > URL: https://issues.apache.org/jira/browse/ARROW-7084 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Micah Kornfield >Priority: Major > > It looks like ArrayRangeEquals in compare.cc only checks type IDs before > doing comparison actual values. This is inconsistent with ArrayEquals which > checks for type equality and also seems incorrect for cases like Decimal128. > > I presume this was an oversight when fixing ARROW-2567 but maybe it was > intentional? > [~uwe]? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7062) [C++] Parquet file parse error messages should include the file name
[ https://issues.apache.org/jira/browse/ARROW-7062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7062: -- Labels: dataset parquet pull-request-available (was: dataset parquet) > [C++] Parquet file parse error messages should include the file name > > > Key: ARROW-7062 > URL: https://issues.apache.org/jira/browse/ARROW-7062 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Dataset >Reporter: Neal Richardson >Priority: Major > Labels: dataset, parquet, pull-request-available > Fix For: 1.0.0 > > > ARROW-7061 was harder to diagnose than it should have been because the error > message was opaque and didn't tell me where to look. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7074) [C++] ASSERT_OK_AND_ASSIGN crashes when failing
[ https://issues.apache.org/jira/browse/ARROW-7074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7074: -- Labels: pull-request-available (was: ) > [C++] ASSERT_OK_AND_ASSIGN crashes when failing > --- > > Key: ARROW-7074 > URL: https://issues.apache.org/jira/browse/ARROW-7074 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Developer Tools >Affects Versions: 0.15.1 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > Instead of simply failing the test, the {{ASSERT_OK_AND_ASSIGN}} macro > crashes when the operation failed, e.g.: > {code} > Value of: _st.ok() > Actual: false > Expected: true > WARNING: Logging before InitGoogleLogging() is written to STDERR > F1106 12:53:32.882110 4698 result.cc:28] ValueOrDie called on an error: XXX > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7091) [C++] Move all factories to type_fwd.h
Antoine Pitrou created ARROW-7091: - Summary: [C++] Move all factories to type_fwd.h Key: ARROW-7091 URL: https://issues.apache.org/jira/browse/ARROW-7091 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.15.1 Reporter: Antoine Pitrou Fix For: 1.0.0 There's no particular reason why parameter-less factories are in {{type_fwd.h}}, but the others in their respective implementation headers. By putting more factories in {{type_fwd.h}}, we may be able to avoid importing the heavier headers in some places. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7090) [C++] AssertFieldEqual (and friends) doesn't show metadata on failure
Antoine Pitrou created ARROW-7090: - Summary: [C++] AssertFieldEqual (and friends) doesn't show metadata on failure Key: ARROW-7090 URL: https://issues.apache.org/jira/browse/ARROW-7090 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Antoine Pitrou If two fields only differ by metadata the error message isn't very informative: {code} ../src/arrow/testing/gtest_util.cc:147: Failure Failed left field: ints: int8 not null right field: ints: int8 not null {code} Perhaps {{DataType::ToString}}, {{Field::ToString}} and {{Schema::ToString}} could get an optional flag to display metadata? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969465#comment-16969465 ] William Young commented on ARROW-1644: -- Are there plans to merge this code? I have a use-case. > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Micah Kornfield >Priority: Major > Labels: parquet, pull-request-available > Fix For: 1.0.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns
[ https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3408. - Resolution: Fixed Issue resolved by pull request 5785 [https://github.com/apache/arrow/pull/5785] > [C++] Add option to CSV reader to dictionary encode individual columns or all > string / binary columns > - > > Key: ARROW-3408 > URL: https://issues.apache.org/jira/browse/ARROW-3408 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, C++ - Dataset >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: csv, dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > For many datasets, dictionary encoding everything can result in drastically > lower memory usage and subsequently better performance in doing analytics > One difficulty of dictionary encoding in multithreaded conversions is that > ideally you end up with one dictionary at the end. So you have two options: > * Implement a concurrent hashing scheme -- for low cardinality dictionaries, > the overhead associated with mutex contention will not be meaningful, for > high cardinality it can be more of a problem > * Hash each chunk separately, then normalize at the end > My guess is that a crude concurrent hash table with a mutex to protect > mutations and resizes is going to outperform the latter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969407#comment-16969407 ] François Blanchard commented on ARROW-7087: --- I will > [Python] Table Metadata disappear when we write a partitioned dataset > - > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > Fix For: 1.0.0 > > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` is missing > >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": > >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": > >> [{"metadata": null, "field_name": "columnA", "name": "columnA", > >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": > >> []}')]) > """{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-7087: Fix Version/s: 1.0.0 > [Python] Table Metadata disappear when we write a partitioned dataset > - > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > Fix For: 1.0.0 > > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` is missing > >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": > >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": > >> [{"metadata": null, "field_name": "columnA", "name": "columnA", > >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": > >> []}')]) > """{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969404#comment-16969404 ] Wes McKinney commented on ARROW-7087: - I would guess this relates to the table splitting logic dropping the metadata. Please feel free to submit a PR to fix > [Python] Table Metadata disappear when we write a partitioned dataset > - > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` is missing > >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": > >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": > >> [{"metadata": null, "field_name": "columnA", "name": "columnA", > >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": > >> []}')]) > """{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-7087: Summary: [Python] Table Metadata disappear when we write a partitioned dataset (was: [Pyarrow] Table Metadata disappear when we write a partitioned dataset) > [Python] Table Metadata disappear when we write a partitioned dataset > - > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` is missing > >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": > >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": > >> [{"metadata": null, "field_name": "columnA", "name": "columnA", > >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": > >> []}')]) > """{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7085) [C++][CSV] Add support for Extention type in csv reader
[ https://issues.apache.org/jira/browse/ARROW-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969403#comment-16969403 ] Wes McKinney commented on ARROW-7085: - [~fexolm] could you clarify what you need -- a custom ColumnBuilder? [~apitrou] should be able to advise you about this > [C++][CSV] Add support for Extention type in csv reader > --- > > Key: ARROW-7085 > URL: https://issues.apache.org/jira/browse/ARROW-7085 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Artem Alekseev >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969395#comment-16969395 ] Wes McKinney commented on ARROW-6820: - I think we should have suggested names, but requiring certain names seems fraught. Since Map data might come from external sources (Spark, Parquet), I don't think it would be appropriate to overwrite the field names that might be used already in those sources. > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types
[ https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969394#comment-16969394 ] Wes McKinney commented on ARROW-7017: - Thanks, yes let's discuss more there. It seems like some investigation is indeed required. I think having LLVM as a build-time dependency is more palatable than as a runtime dependency in some applications. > [C++] Refactor AddKernel to support other operations and types > -- > > Key: ARROW-7017 > URL: https://issues.apache.org/jira/browse/ARROW-7017 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute >Reporter: Francois Saint-Jacques >Priority: Major > Labels: analytics > > * Should avoid using builders (and/or NULLs) since the output shape is known > a compute time. > * Should be refatored to support other operations, e.g. Substraction, > Multiplication. > * Should have a overflow, underflow detection mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7089) [C++] In CMake output, list each enabled thirdparty toolchain dependency and the reason for its being enabled
Wes McKinney created ARROW-7089: --- Summary: [C++] In CMake output, list each enabled thirdparty toolchain dependency and the reason for its being enabled Key: ARROW-7089 URL: https://issues.apache.org/jira/browse/ARROW-7089 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney For example, for gtest it would say that it's enabled because ARROW_BUILD_TESTS=ON -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] François Blanchard updated ARROW-7087: -- Attachment: (was: Capture d’écran 2019-11-07 à 16.46.37.png) > [Pyarrow] Table Metadata disappear when we write a partitioned dataset > -- > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` is missing > >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": > >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": > >> [{"metadata": null, "field_name": "columnA", "name": "columnA", > >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": > >> []}')]) > """{code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7088) [C++][Python] gcc 4.8 / wheel builds failing after PARQUET-1678
Wes McKinney created ARROW-7088: --- Summary: [C++][Python] gcc 4.8 / wheel builds failing after PARQUET-1678 Key: ARROW-7088 URL: https://issues.apache.org/jira/browse/ARROW-7088 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney Fix For: 1.0.0 See https://travis-ci.org/ursa-labs/crossbow/builds/608629511?utm_source=github_status_medium=notification {code} /usr/bin/ccache /opt/rh/devtoolset-2/root/usr/bin/c++ -DARROW_JEMALLOC -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_GLOG -DARROW_USE_SIMD -DARROW_WITH_ZSTD -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DHAVE_NETINET_IN_H -DPARQUET_EXPORTING -DPARQUET_USE_BOOST_REGEX -Isrc -I/arrow/cpp/src -I/arrow/cpp/src/generated -isystem /arrow/cpp/thirdparty/flatbuffers/include -isystem /arrow_boost_dist/include -isystem /usr/local/include -isystem jemalloc_ep-prefix/src -isystem /arrow/cpp/thirdparty/hadoop/include -O3 -DNDEBUG -Wall -Wno-attributes -msse4.2 -O3 -DNDEBUG -fPIC -std=gnu++11 -MD -MT src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -MF src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o.d -o src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -c /arrow/cpp/src/parquet/stream_reader.cc In file included from /arrow/cpp/src/parquet/stream_reader.h:31:0, from /arrow/cpp/src/parquet/stream_reader.cc:18: /arrow/cpp/src/parquet/stream_writer.h:67:17: error: function ‘parquet::StreamWriter& parquet::StreamWriter::operator=(parquet::StreamWriter&&)’ defaulted on its first declaration with an exception-specification that differs from the implicit declaration ‘parquet::StreamWriter& parquet::StreamWriter::operator=(parquet::StreamWriter&&)’ StreamWriter& operator=(StreamWriter&&) noexcept = default; ^ In file included from /arrow/cpp/src/parquet/stream_reader.cc:18:0: /arrow/cpp/src/parquet/stream_reader.h:61:17: error: function ‘parquet::StreamReader& parquet::StreamReader::operator=(parquet::StreamReader&&)’ defaulted on its first declaration with an exception-specification that differs from the implicit declaration ‘parquet::StreamReader& parquet::StreamReader::operator=(parquet::StreamReader&&)’ StreamReader& operator=(StreamReader&&) noexcept = default; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] François Blanchard updated ARROW-7087: -- Description: There is an unexpected behavior with the method *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* in *pyarrow/parquet.py* When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined *partition_cols*. To be more explicit here is an example code: {code:python} from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata """ Metadata is set as expected >> OrderedDict([('data', 'test')]) """ # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata """ Metadata with the key `data` is missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": >> [{"metadata": null, "field_name": "columnA", "name": "columnA", >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) """{code} was: There is an unexpected behavior with the method *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* in *pyarrow/parquet.py* When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined *partition_cols*. To be more explicit here is an example code: {code:python} from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata """ Metadata is set as expected >> OrderedDict([('data', 'test')]) """ # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata """ Metadata with the key `data` are missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": >> [{"metadata": null, "field_name": "columnA", "name": "columnA", >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) """{code} > [Pyarrow] Table Metadata disappear when we write a partitioned dataset > -- > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` is missing > >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": > >> "pyarrow"}, "pandas_version": "0.22.0",
[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] François Blanchard updated ARROW-7087: -- Description: There is an unexpected behavior with the method *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* in *pyarrow/parquet.py* When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined *partition_cols*. To be more explicit here is an example code: {code:python} from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata """ Metadata is set as expected >> OrderedDict([('data', 'test')]) """ # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata """ Metadata with the key `data` are missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": >> [{"metadata": null, "field_name": "columnA", "name": "columnA", >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) """{code} was: There is an unexpected behavior with the method *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* in *pyarrow/parquet.py* When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined *partition_cols*. To be more explicit here is an example code: {code:python} from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata ''' Metadata is set as expected >> OrderedDict([('data', 'test')]) ''' # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata ''' Metadata with the key `data` are missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": >> [{"metadata": null, "field_name": "columnA", "name": "columnA", >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) '''{code} > [Pyarrow] Table Metadata disappear when we write a partitioned dataset > -- > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > Attachments: Capture d’écran 2019-11-07 à 16.46.37.png > > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > """ > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > """ > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > """ > Metadata with the key `data` are missing > >> OrderedDict([('pandas', '{"creator": {"version":
[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset
[ https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] François Blanchard updated ARROW-7087: -- Description: There is an unexpected behavior with the method *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* in *pyarrow/parquet.py* When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined *partition_cols*. To be more explicit here is an example code: {code:python} from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata ''' Metadata is set as expected >> OrderedDict([('data', 'test')]) ''' # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata ''' Metadata with the key `data` are missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": >> [{"metadata": null, "field_name": "columnA", "name": "columnA", >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) '''{code} was: There is an unexpected behavior with the method *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* in *pyarrow/parquet.py* When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined *partition_cols*. To be more explicit here is an example code: {code:python} from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata ``` Metadata is set as expected >> OrderedDict([('data', 'test')]) ``` # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata ``` Metadata with the key `data` are missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": >> [{"metadata": null, "field_name": "columnA", "name": "columnA", >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) ```{code} > [Pyarrow] Table Metadata disappear when we write a partitioned dataset > -- > > Key: ARROW-7087 > URL: https://issues.apache.org/jira/browse/ARROW-7087 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: François Blanchard >Priority: Major > Attachments: Capture d’écran 2019-11-07 à 16.46.37.png > > > There is an unexpected behavior with the method > *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* > in *pyarrow/parquet.py* > When we write a table that contains metadata then metadata are replaced by > pandas metadata. This happens only if we defined *partition_cols*. > > To be more explicit here is an example code: > {code:python} > from pyarrow.parquet import write_to_dataset > import pyarrow as pa > import pyarrow.parquet as pd > columnA = pa.array(['a', 'b', 'c'], type=pa.string()) > columnB = pa.array([1, 1, 2], type=pa.int32()) > # Build table from collumns > table = pa.Table.from_arrays([columnA, columnB], names=['columnA', > 'columnB'], metadata={'data': 'test'}) > print table.schema.metadata > ''' > Metadata is set as expected > >> OrderedDict([('data', 'test')]) > ''' > # Write table in parquet format partitioned per columnB > write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) > # Load data from parquet files > ds = pd.ParquetDataset('/path/to/test') > load_table = pq.read_table(ds.pieces[0].path) > print load_table.schema.metadata > ''' > Metadata with the key `data` are missing > >> OrderedDict([('pandas', '{"creator": {"version":
[jira] [Created] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset
François Blanchard created ARROW-7087: - Summary: [Pyarrow] Table Metadata disappear when we write a partitioned dataset Key: ARROW-7087 URL: https://issues.apache.org/jira/browse/ARROW-7087 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Reporter: François Blanchard Attachments: Capture d’écran 2019-11-07 à 16.46.37.png There is an unexpected behavior with the method *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]* in *pyarrow/parquet.py* When we write a table that contains metadata then metadata are replaced by pandas metadata. This happens only if we defined *partition_cols*. To be more explicit here is an example code: {code:python} from pyarrow.parquet import write_to_dataset import pyarrow as pa import pyarrow.parquet as pd columnA = pa.array(['a', 'b', 'c'], type=pa.string()) columnB = pa.array([1, 1, 2], type=pa.int32()) # Build table from collumns table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], metadata={'data': 'test'}) print table.schema.metadata ``` Metadata is set as expected >> OrderedDict([('data', 'test')]) ``` # Write table in parquet format partitioned per columnB write_to_dataset(table, '/path/to/test', partition_cols=['columnB']) # Load data from parquet files ds = pd.ParquetDataset('/path/to/test') load_table = pq.read_table(ds.pieces[0].path) print load_table.schema.metadata ``` Metadata with the key `data` are missing >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": >> [{"metadata": null, "field_name": "columnA", "name": "columnA", >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')]) ```{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result
[ https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman updated ARROW-7086: Description: There is a proliferation of code like: {code} Result SafeAdd(int a, int b) { int out; RETURN_NOT_OK(SafeAdd(a, b, )); return out; } {code} Ideally, this should be resolved by moving the implementation of SafeAdd into the Result returning overload then using {{Result::Value}} in the Status returning function. In cases where this is inconvenient, it'd be helpful to have an adapter for doing this more efficiently: {code} Result SafeAdd(int a, int b) { return RESULT_INVOKE(SafeAdd, a, b); } {code} This will probably have to be a macro; otherwise the return type can be inferred but only when the function is not overloaded was: There is a proliferation of code like: {code} Result SafeAdd(int a, int b) { int out; RETURN_NOT_OK(DoSafeAdd(a, b, )); return out; } {code} Ideally, this should be resolved by moving the implementation of SafeAdd into the Result returning function then using {{Result::Value}} in the Status returning function. In cases where this is inconvenient, it'd be helpful to have an adapter for doing this more efficiently: {code} Result SafeAdd(int a, int b) { return RESULT_INVOKE(DoSafeAdd, a, b); } {code} This will probably have to be a macro; otherwise the return type can be inferred but only when the function is not overloaded > [C++] Provide a wrapper for invoking factories to produce a Result > -- > > Key: ARROW-7086 > URL: https://issues.apache.org/jira/browse/ARROW-7086 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > There is a proliferation of code like: > {code} > Result SafeAdd(int a, int b) { > int out; > RETURN_NOT_OK(SafeAdd(a, b, )); > return out; > } > {code} > Ideally, this should be resolved by moving the implementation of SafeAdd into > the Result returning overload then using {{Result::Value}} in the Status > returning function. In cases where this is inconvenient, it'd be helpful to > have an adapter for doing this more efficiently: > {code} > Result SafeAdd(int a, int b) { > return RESULT_INVOKE(SafeAdd, a, b); > } > {code} > This will probably have to be a macro; otherwise the return type can be > inferred but only when the function is not overloaded -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result
[ https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman updated ARROW-7086: Description: There is a proliferation of code like: {code} Result SafeAdd(int a, int b) { int out; RETURN_NOT_OK(DoSafeAdd(a, b, )); return out; } {code} Ideally, this should be resolved by moving the implementation of SafeAdd into the Result returning function then using {{Result::Value}} in the Status returning function. In cases where this is inconvenient, it'd be helpful to have an adapter for doing this more efficiently: {code} Result SafeAdd(int a, int b) { return RESULT_INVOKE(DoSafeAdd, a, b); } {code} This will probably have to be a macro; otherwise the return type can be inferred but only when the function is not overloaded was: There is a proliferation of code like: {code} Result SafeAdd(int a, int b) { int out; RETURN_NOT_OK(DoSafeAdd(a, b, )); return out; } {code} Ideally, this should be resolved by moving the implementation of SafeAdd into the Result returning function then using {{Result::Value}} in the Status returning function. In cases where this is inconvenient, it'd be helpful to have an adapter for doing this more efficiently: {code} Result SafeAdd(int a, int b) { return RESULT_INVOKE(DoSafeAdd, a, b); } {code} > [C++] Provide a wrapper for invoking factories to produce a Result > -- > > Key: ARROW-7086 > URL: https://issues.apache.org/jira/browse/ARROW-7086 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > There is a proliferation of code like: > {code} > Result SafeAdd(int a, int b) { > int out; > RETURN_NOT_OK(DoSafeAdd(a, b, )); > return out; > } > {code} > Ideally, this should be resolved by moving the implementation of SafeAdd into > the Result returning function then using {{Result::Value}} in the Status > returning function. In cases where this is inconvenient, it'd be helpful to > have an adapter for doing this more efficiently: > {code} > Result SafeAdd(int a, int b) { > return RESULT_INVOKE(DoSafeAdd, a, b); > } > {code} > This will probably have to be a macro; otherwise the return type can be > inferred but only when the function is not overloaded -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result
[ https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman updated ARROW-7086: Description: There is a proliferation of code like: {code} Result SafeAdd(int a, int b) { int out; RETURN_NOT_OK(DoSafeAdd(a, b, )); return out; } {code} Ideally, this should be resolved by moving the implementation of SafeAdd into the Result returning function then using {{Result::Value}} in the Status returning function. In cases where this is inconvenient, it'd be helpful to have an adapter for doing this more efficiently: {code} Result SafeAdd(int a, int b) { return RESULT_INVOKE(DoSafeAdd, a, b); } {code} was: There is a proliferation of code like: {code} Result SafeAdd(int a, int b) { int out; RETURN_NOT_OK(DoSafeAdd(a, b, )); return out; } {code} Ideally, this should be resolved by moving the implementation of SafeAdd into the Result returning function then using {{Result::Value}} in the Status returning function. In cases where this is inconvenient, it'd be helpful to have an adapter for doing this more efficiently: {code} Result SafeAdd(int a, int b) { return ResultInvoke(DoSafeAdd, a, b); } {code} > [C++] Provide a wrapper for invoking factories to produce a Result > -- > > Key: ARROW-7086 > URL: https://issues.apache.org/jira/browse/ARROW-7086 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > There is a proliferation of code like: > {code} > Result SafeAdd(int a, int b) { > int out; > RETURN_NOT_OK(DoSafeAdd(a, b, )); > return out; > } > {code} > Ideally, this should be resolved by moving the implementation of SafeAdd into > the Result returning function then using {{Result::Value}} in the Status > returning function. In cases where this is inconvenient, it'd be helpful to > have an adapter for doing this more efficiently: > {code} > Result SafeAdd(int a, int b) { > return RESULT_INVOKE(DoSafeAdd, a, b); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result
[ https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969288#comment-16969288 ] Ben Kietzman commented on ARROW-7086: - [~emkornfield] > [C++] Provide a wrapper for invoking factories to produce a Result > -- > > Key: ARROW-7086 > URL: https://issues.apache.org/jira/browse/ARROW-7086 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.15.1 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 1.0.0 > > > There is a proliferation of code like: > {code} > Result SafeAdd(int a, int b) { > int out; > RETURN_NOT_OK(DoSafeAdd(a, b, )); > return out; > } > {code} > Ideally, this should be resolved by moving the implementation of SafeAdd into > the Result returning function then using {{Result::Value}} in the Status > returning function. In cases where this is inconvenient, it'd be helpful to > have an adapter for doing this more efficiently: > {code} > Result SafeAdd(int a, int b) { > return ResultInvoke(DoSafeAdd, a, b); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result
Ben Kietzman created ARROW-7086: --- Summary: [C++] Provide a wrapper for invoking factories to produce a Result Key: ARROW-7086 URL: https://issues.apache.org/jira/browse/ARROW-7086 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.15.1 Reporter: Ben Kietzman Assignee: Ben Kietzman Fix For: 1.0.0 There is a proliferation of code like: {code} Result SafeAdd(int a, int b) { int out; RETURN_NOT_OK(DoSafeAdd(a, b, )); return out; } {code} Ideally, this should be resolved by moving the implementation of SafeAdd into the Result returning function then using {{Result::Value}} in the Status returning function. In cases where this is inconvenient, it'd be helpful to have an adapter for doing this more efficiently: {code} Result SafeAdd(int a, int b) { return ResultInvoke(DoSafeAdd, a, b); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-4631) [C++] Implement serial version of sort computational kernel
[ https://issues.apache.org/jira/browse/ARROW-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Artem Alekseev reassigned ARROW-4631: - Assignee: (was: Artem Alekseev) > [C++] Implement serial version of sort computational kernel > --- > > Key: ARROW-4631 > URL: https://issues.apache.org/jira/browse/ARROW-4631 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Affects Versions: 0.13.0 >Reporter: Areg Melik-Adamyan >Priority: Major > Labels: analytics > Fix For: 1.0.0 > > > Implement serial version of sort computational kernel. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7085) [C++][CSV] Add support for Extention type in csv reader
Artem Alekseev created ARROW-7085: - Summary: [C++][CSV] Add support for Extention type in csv reader Key: ARROW-7085 URL: https://issues.apache.org/jira/browse/ARROW-7085 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Artem Alekseev -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3410) [C++][Dataset] Streaming CSV reader interface for memory-constrainted environments
[ https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969260#comment-16969260 ] Antoine Pitrou commented on ARROW-3410: --- > Due to previous point, ScanTask should not hold memory until consumed Hmm... to define the blocks in a CSV file, I have to read the CSV file entirely. So if memory isn't held, then each ScanTask will have to read the CSV file a second time. This may not be a big problem (but still suboptimal - memory copies) if the CSV file stays in the filesystem cache, but what about a huge CSV file? The only reasonable way to ingest a CSV file in parallel is to do the chunking while reading the file, AFAIK. > ScanTask are expected to be bound to a single thread and shouldn't have > nested parallelism. Why that? It shouldn't be a problem if using the global thread pool. > [C++][Dataset] Streaming CSV reader interface for memory-constrainted > environments > -- > > Key: ARROW-3410 > URL: https://issues.apache.org/jira/browse/ARROW-3410 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, C++ - Dataset >Reporter: Wes McKinney >Priority: Major > Labels: dataset > > CSV reads are currently all-or-nothing. If the results of parsing a CSV file > do not fit into memory, this can be a problem. I propose to define a > streaming {{RecordBatchReader}} interface so that the record batches produced > by reading can be written out immediately to a stream on disk, to be memory > mapped later -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3410) [C++][Dataset] Streaming CSV reader interface for memory-constrainted environments
[ https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969243#comment-16969243 ] Francois Saint-Jacques commented on ARROW-3410: --- Ideally not a RecordBatch iterator. Looking at `file_parquet.cc` is your best bet. # CSV reader's options should be in an instance of CSVFileFormat. # Implement `CSVFileFormat::Inspect()`, this is needed to "peek" the Schema of a file. It should be possible to limit the number of rows parsed (in the constructor of CSVFileFormat) for the inspect call. # Implement `CSVFileFormat::ScanFile()`. This returns a ScanTaskIterator. A ScanTask is a closure that yields an Iterator. Some expected requirements by callers (Scanner::ToTable()) of ScanFile: * ScanFile should be fast-ish. It is used to enumerate all ScanTasks before dispatching to the thread pool. It is ran serially over all fragments in a DataSource (this could change). * Due to previous point, ScanTask should not hold memory until consumed (in parquet, it only holds the row_group_id). In the case of CSV, it might be that the Blocks are referenced by (offset, length) instead of a shared_ptr. * ScanTask are expected to be bound to a single thread and shouldn't have nested parallelism. * No inference should be done, the user _always_ pass an explicit schema at the DataSource construction time. * Ensure that column subset projection is properly done, (see InferColumnProjection in parquet). This is probably the only optimization we can make for now, there's nothing much we can do about predicate pushdown. The way I foresee how it is implement is the following: * The CSV parser divides the file in blocks in ScanFile(), each block is bound to a ScanTask. As noted, this needs to be done in a fashion that does not hold memory. * A ScanTask parses a block an yields one-or-more RecordBatch. This is very similar to the current ThreadedReader with some differences: * Inversion of control, it yields tasks instead of dispatching them directly. * The Block iterator must not be blocking and not hold buffers. > [C++][Dataset] Streaming CSV reader interface for memory-constrainted > environments > -- > > Key: ARROW-3410 > URL: https://issues.apache.org/jira/browse/ARROW-3410 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, C++ - Dataset >Reporter: Wes McKinney >Priority: Major > Labels: dataset > > CSV reads are currently all-or-nothing. If the results of parsing a CSV file > do not fit into memory, this can be a problem. I propose to define a > streaming {{RecordBatchReader}} interface so that the record batches produced > by reading can be written out immediately to a stream on disk, to be memory > mapped later -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels
[ https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-7083: -- Description: See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] Requirements: 1. No hard runtime dependency on LLVM 2. Ability to run without LLVM static/shared libraries. Open questions: 1. What dependencies does this add to the build tool chain? was: See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] Requirements: 1. No hard runtime dependency on LLVM 2. Ability to run without JIT. Open questions: 1. What dependencies does this add to the build tool chain? > [C++] Determine the feasibility and build a prototype to replace > compute/kernels with gandiva kernels > - > > Key: ARROW-7083 > URL: https://issues.apache.org/jira/browse/ARROW-7083 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Compute, C++ - Gandiva >Reporter: Micah Kornfield >Priority: Major > > See discussion on [https://issues.apache.org/jira/browse/ARROW-7017] > > Requirements: > 1. No hard runtime dependency on LLVM > 2. Ability to run without LLVM static/shared libraries. > > Open questions: > 1. What dependencies does this add to the build tool chain? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969186#comment-16969186 ] Antoine Pitrou commented on ARROW-6820: --- Names might become significant in some contexts, for example if data is converted into other formats. Regardless, the inconsistency is a bit confusing. > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3410) [C++][Dataset] Streaming CSV reader interface for memory-constrainted environments
[ https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969181#comment-16969181 ] Antoine Pitrou commented on ARROW-3410: --- [~fsaintjacques] What kind of API would Datasets need from a streaming CSV reader? A RecordBatch iterator? Something else? > [C++][Dataset] Streaming CSV reader interface for memory-constrainted > environments > -- > > Key: ARROW-3410 > URL: https://issues.apache.org/jira/browse/ARROW-3410 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, C++ - Dataset >Reporter: Wes McKinney >Priority: Major > Labels: dataset > > CSV reads are currently all-or-nothing. If the results of parsing a CSV file > do not fit into memory, this can be a problem. I propose to define a > streaming {{RecordBatchReader}} interface so that the record batches produced > by reading can be written out immediately to a stream on disk, to be memory > mapped later -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7084) [C++] ArrayRangeEquals should check for full type equality?
Micah Kornfield created ARROW-7084: -- Summary: [C++] ArrayRangeEquals should check for full type equality? Key: ARROW-7084 URL: https://issues.apache.org/jira/browse/ARROW-7084 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Micah Kornfield It looks like ArrayRangeEquals in compare.cc only checks type IDs before doing comparison actual values. This is inconsistent with ArrayEquals which checks for type equality and also seems incorrect for cases like Decimal128. I presume this was an oversight when fixing ARROW-2567 but maybe it was intentional? [~uwe]? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969024#comment-16969024 ] Micah Kornfield commented on ARROW-6820: /// The names of the /// child fields *may* be respectively "entry", "key", and "value", *but this is* */// not enforced* I'm not sure I understand the issue. The way I read the spec, naming is not enforced. See bolded section. > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)