[jira] [Assigned] (ARROW-5882) [C++][Gandiva] Throw error if divisor is 0 in integer mod functions

2019-11-07 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla reassigned ARROW-5882:
---

Assignee: Projjal Chanda  (was: Prudhvi Porandla)

> [C++][Gandiva] Throw error if divisor is 0 in integer mod functions 
> 
>
> Key: ARROW-5882
> URL: https://issues.apache.org/jira/browse/ARROW-5882
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Prudhvi Porandla
>Assignee: Projjal Chanda
>Priority: Minor
>
> mod_int64_int32, mod_int64_int64 should throw an error when divisor is 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7099) [C++] Disambiguate function calls in csv parser test

2019-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7099:
--
Labels: pull-request-available  (was: )

> [C++] Disambiguate function calls in csv parser test
> 
>
> Key: ARROW-7099
> URL: https://issues.apache.org/jira/browse/ARROW-7099
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>  Labels: pull-request-available
>
> cpp/src/arrow/csv/parser_test.cc has calls to overloaded functions which 
> cannot be disambiguated. see https://github.com/apache/arrow/pull/5727



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7099) [C++] Disambiguate function calls in csv parser test

2019-11-07 Thread Prudhvi Porandla (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prudhvi Porandla updated ARROW-7099:

Description: cpp/src/arrow/csv/parser_test.cc has calls to overloaded 
functions which cannot be disambiguated. see 
https://github.com/apache/arrow/pull/5727

> [C++] Disambiguate function calls in csv parser test
> 
>
> Key: ARROW-7099
> URL: https://issues.apache.org/jira/browse/ARROW-7099
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Minor
>
> cpp/src/arrow/csv/parser_test.cc has calls to overloaded functions which 
> cannot be disambiguated. see https://github.com/apache/arrow/pull/5727



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7099) [C++] Disambiguate function calls in csv parser test

2019-11-07 Thread Prudhvi Porandla (Jira)
Prudhvi Porandla created ARROW-7099:
---

 Summary: [C++] Disambiguate function calls in csv parser test
 Key: ARROW-7099
 URL: https://issues.apache.org/jira/browse/ARROW-7099
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7098) [Java] Improve the performance of comparing two memory blocks

2019-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7098:
--
Labels: pull-request-available  (was: )

> [Java] Improve the performance of comparing two memory blocks
> -
>
> Key: ARROW-7098
> URL: https://issues.apache.org/jira/browse/ARROW-7098
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>
> We often use the 8-4-1 paradigm to compare two blocks of memory:
> 1. First compare by 8-byte blocks in a loop
> 2. Then compare by 4-byte blocks in a loop
> 3. Last compare by 1-byte blocks in a loop
> It can be proved that the second loop runs at most once. So we can replace 
> the loop with a if statement, which will save us a comparison and two jump 
> operations. 
> According to the discussion in 
> https://github.com/apache/arrow/pull/5508#discussion_r343973982, loop can be 
> expensive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7098) [Java] Improve the performance of comparing two memory blocks

2019-11-07 Thread Liya Fan (Jira)
Liya Fan created ARROW-7098:
---

 Summary: [Java] Improve the performance of comparing two memory 
blocks
 Key: ARROW-7098
 URL: https://issues.apache.org/jira/browse/ARROW-7098
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We often use the 8-4-1 paradigm to compare two blocks of memory:
1. First compare by 8-byte blocks in a loop
2. Then compare by 4-byte blocks in a loop
3. Last compare by 1-byte blocks in a loop

It can be proved that the second loop runs at most once. So we can replace the 
loop with a if statement, which will save us a comparison and two jump 
operations. 

According to the discussion in 
https://github.com/apache/arrow/pull/5508#discussion_r343973982, loop can be 
expensive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6911) [Java] Provide composite comparator

2019-11-07 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6911.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5678
[https://github.com/apache/arrow/pull/5678]

> [Java] Provide composite comparator
> ---
>
> Key: ARROW-6911
> URL: https://issues.apache.org/jira/browse/ARROW-6911
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A composite comparator is a sub-class of VectorValueComparator that contains 
> an array of inner comparators, with each comparator corresponding to one 
> column for comparison. It can be used to support sort/comparison operations 
> for VectorSchemaRoot/StructVector.
> The composite comparator works like this: it first uses the first internal 
> comparator (for the primary sort key) to compare vector values. If it gets a 
> non-zero value, we just return it; otherwise, we use the second comparator to 
> break the tie, and so on, until a non-zero value is produced by some internal 
> comparator, or all internal comparators have been used. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7020) [Java] Fix the bugs when calculating vector hash code

2019-11-07 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-7020.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5752
[https://github.com/apache/arrow/pull/5752]

> [Java] Fix the bugs when calculating vector hash code
> -
>
> Key: ARROW-7020
> URL: https://issues.apache.org/jira/browse/ARROW-7020
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> When calculating the hash code for a value in the vector, the validity bit 
> must be taken into account.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7097) [Rust][CI] Builds failing due to rust nightly

2019-11-07 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7097:
---

 Summary: [Rust][CI] Builds failing due to rust nightly
 Key: ARROW-7097
 URL: https://issues.apache.org/jira/browse/ARROW-7097
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Wes McKinney
 Fix For: 1.0.0


see e.g. https://github.com/apache/arrow/runs/293573608 on master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969762#comment-16969762
 ] 

Wes McKinney commented on ARROW-7083:
-

Note that no query engine development has been done so that design document is 
simply a proposal until actual work happens. 

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute, C++ - Gandiva
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7096) [C++] Add options structs for concatenation-with-promotion and schema unification

2019-11-07 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7096:
---

 Summary: [C++] Add options structs for 
concatenation-with-promotion and schema unification
 Key: ARROW-7096
 URL: https://issues.apache.org/jira/browse/ARROW-7096
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


Follow up to ARROW-6625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7091) [C++] Move all factories to type_fwd.h

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969750#comment-16969750
 ] 

Wes McKinney commented on ARROW-7091:
-

+1

> [C++] Move all factories to type_fwd.h
> --
>
> Key: ARROW-7091
> URL: https://issues.apache.org/jira/browse/ARROW-7091
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 1.0.0
>
>
> There's no particular reason why parameter-less factories are in 
> {{type_fwd.h}}, but the others in their respective implementation headers. By 
> putting more factories in {{type_fwd.h}}, we may be able to avoid importing 
> the heavier headers in some places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7088) [C++][Python] gcc 4.8 / wheel builds failing after PARQUET-1678

2019-11-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-7088.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Duplicate

Closing in favor of PARQUET-1688

> [C++][Python] gcc 4.8 / wheel builds failing after PARQUET-1678
> ---
>
> Key: ARROW-7088
> URL: https://issues.apache.org/jira/browse/ARROW-7088
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Blocker
>
> See 
> https://travis-ci.org/ursa-labs/crossbow/builds/608629511?utm_source=github_status_medium=notification
> {code}
> /usr/bin/ccache /opt/rh/devtoolset-2/root/usr/bin/c++  -DARROW_JEMALLOC 
> -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_GLOG -DARROW_USE_SIMD 
> -DARROW_WITH_ZSTD -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DHAVE_NETINET_IN_H 
> -DPARQUET_EXPORTING -DPARQUET_USE_BOOST_REGEX -Isrc -I/arrow/cpp/src 
> -I/arrow/cpp/src/generated -isystem /arrow/cpp/thirdparty/flatbuffers/include 
> -isystem /arrow_boost_dist/include -isystem /usr/local/include -isystem 
> jemalloc_ep-prefix/src -isystem /arrow/cpp/thirdparty/hadoop/include -O3 
> -DNDEBUG  -Wall -Wno-attributes -msse4.2  -O3 -DNDEBUG -fPIC   -std=gnu++11 
> -MD -MT src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -MF 
> src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o.d -o 
> src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -c 
> /arrow/cpp/src/parquet/stream_reader.cc
> In file included from /arrow/cpp/src/parquet/stream_reader.h:31:0,
>  from /arrow/cpp/src/parquet/stream_reader.cc:18:
> /arrow/cpp/src/parquet/stream_writer.h:67:17: error: function 
> ‘parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)’ defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration ‘parquet::StreamWriter& 
> parquet::StreamWriter::operator=(parquet::StreamWriter&&)’
>StreamWriter& operator=(StreamWriter&&) noexcept = default;
>  ^
> In file included from /arrow/cpp/src/parquet/stream_reader.cc:18:0:
> /arrow/cpp/src/parquet/stream_reader.h:61:17: error: function 
> ‘parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)’ defaulted on its 
> first declaration with an exception-specification that differs from the 
> implicit declaration ‘parquet::StreamReader& 
> parquet::StreamReader::operator=(parquet::StreamReader&&)’
>StreamReader& operator=(StreamReader&&) noexcept = default;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

2019-11-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969714#comment-16969714
 ] 

Micah Kornfield commented on ARROW-4890:


Yes.  I believe it is 2GB per shard currently.

> [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
> -
>
> Key: ARROW-4890
> URL: https://issues.apache.org/jira/browse/ARROW-4890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Cloudera cdh5.13.3
> Cloudera Spark 2.3.0.cloudera3
>Reporter: Abdeali Kothari
>Priority: Major
> Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png
>
>
> Creating this in Arrow project as the traceback seems to suggest this is an 
> issue in Arrow.
>  Continuation from the conversation on the 
> https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E
> When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
> {noformat}
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 279, in load_stream
> for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in 
> pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
> {noformat}
> as my dataset size starts increasing that I want to group on. Here is a 
> reproducible code snippet where I can reproduce this.
>  Note: My actual dataset is much larger and has many more unique IDs and is a 
> valid usecase where I cannot simplify this groupby in any way. I have 
> stripped out all the logic to make this example as simple as I could.
> {code:java}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> pdf1 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
>   columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
>   columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> return df
> df4 = df3
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> # df5.write.parquet('/tmp/temp.parquet', mode='overwrite')
> {code}
> I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per 
> executor too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969712#comment-16969712
 ] 

Micah Kornfield commented on ARROW-1644:


The code isn't really super useable since it is based on the old repo and a lot 
of changes have been made (and it had a performance regression).  I haven't had 
time to work on this, but still hope to get some bandwidth in the next month or 
so.  But if there are motivated parties I'm happy to remove my name from the 
assignment.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2019-11-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969709#comment-16969709
 ] 

Micah Kornfield commented on ARROW-7083:


[~yuanzhou] that is what this Jira is about.  Currently all the kernels are 
100% C++ and don't use Gandiva.  The question is how feasible is it to reuse 
Gandiva kernels in a non-JIT environment.  It would be nice to not duplicate 
code but in some contexts JIT isn't an option.

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute, C++ - Gandiva
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2019-11-07 Thread Yuan Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969683#comment-16969683
 ] 

Yuan Zhou commented on ARROW-7083:
--

Hi [~emkornfi...@gmail.com]

For the coming AQE what kernels will Arrows use, Is it using 100% C++ kernels? 
or a combination of C++ and Gandiva kernels?

In the design draft there seems to make a combination of these two kernels. 
https://docs.google.com/document/d/10RoUZmiMQRi_J1FcPeVAUAMJ6d_ZuiEbaM2Y33sNPu4/edit#heading=h.2k6k5a4y9b8y

Cheers, -yuan

> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute, C++ - Gandiva
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7095) [R] Better handling of unsupported filter expression in dplyr methods

2019-11-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7095:
--

 Summary: [R] Better handling of unsupported filter expression in 
dplyr methods
 Key: ARROW-7095
 URL: https://issues.apache.org/jira/browse/ARROW-7095
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Followup to ARROW-6340. Consider erroring instead of calling `collect()` on a 
Dataset and filtering in R. Or see if there's a safer way to defer evaluation 
that may allow for less data to be pulled down to R to filter after. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7094) [R] Change FileSystem access in Datasets to shared_ptr

2019-11-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7094:
--

 Summary: [R] Change FileSystem access in Datasets to shared_ptr
 Key: ARROW-7094
 URL: https://issues.apache.org/jira/browse/ARROW-7094
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Followup to ARROW-6340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7094) [R] Change FileSystem access in Datasets to shared_ptr

2019-11-07 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7094:
---
Component/s: C++ - Dataset

> [R] Change FileSystem access in Datasets to shared_ptr
> --
>
> Key: ARROW-7094
> URL: https://issues.apache.org/jira/browse/ARROW-7094
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Dataset, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> Followup to ARROW-6340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7093) [R] Support creating ScalarExpressions for more data types

2019-11-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7093:
--

 Summary: [R] Support creating ScalarExpressions for more data types
 Key: ARROW-7093
 URL: https://issues.apache.org/jira/browse/ARROW-7093
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
 Fix For: 1.0.0


ARROW-6340 was limited to integer/double/logical.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7092) [R] Add vignette for dplyr and datasets

2019-11-07 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-7092:
--

 Summary: [R] Add vignette for dplyr and datasets
 Key: ARROW-7092
 URL: https://issues.apache.org/jira/browse/ARROW-7092
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 1.0.0


Followup to ARROW-6340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7082) [Packaging][deb] Add apache-arrow-archive-keyring

2019-11-07 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-7082.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5786
[https://github.com/apache/arrow/pull/5786]

> [Packaging][deb] Add apache-arrow-archive-keyring
> -
>
> Key: ARROW-7082
> URL: https://issues.apache.org/jira/browse/ARROW-7082
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6792) [R] Explore roxygen2 R6 class documentation

2019-11-07 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969593#comment-16969593
 ] 

Neal Richardson commented on ARROW-6792:


I took a look at this in the course of writing documentation for ARROW-6340. 
Some observations:

* It's all or nothing. If we use the new roxygen at all, we have to update all 
of our existing docs. In that PR I added `r6 = FALSE` to the RoxygenNote to 
keep the old behavior for now.
* The first disqualifying feature I noticed is the that the new R6 stuff 
doesn't like how we documented several classes in the same file. It just 
repeats "Super classes" and "Methods" sections down the page. See 
https://github.com/r-lib/roxygen2/issues/961.
* Bad crossreferences (reported https://github.com/r-lib/pkgdown/issues/1177)

> [R] Explore roxygen2 R6 class documentation
> ---
>
> Key: ARROW-6792
> URL: https://issues.apache.org/jira/browse/ARROW-6792
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 1.0.0
>
>
> roxygen2 version 7.0 adds support for documenting R6 classes, rather than the 
> ad hoc approach we've had to take without it: 
> [https://github.com/r-lib/roxygen2/blob/master/vignettes/rd.Rmd#L203]
> Try it out and see how we like it, and consider refactoring the docs to use 
> it everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7084) [C++] ArrayRangeEquals should check for full type equality?

2019-11-07 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969576#comment-16969576
 ] 

Uwe Korn commented on ARROW-7084:
-

It was an oversight when fixing ARROW-2567 , we should also fix 
ArrayRangeEquals.

> [C++]  ArrayRangeEquals should check for full type equality?
> 
>
> Key: ARROW-7084
> URL: https://issues.apache.org/jira/browse/ARROW-7084
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> It looks like ArrayRangeEquals in compare.cc only checks type IDs before 
> doing comparison actual values.  This is inconsistent with ArrayEquals which 
> checks for type equality and also seems incorrect for cases like Decimal128. 
>  
> I presume this was an oversight when fixing ARROW-2567 but maybe it was 
> intentional?
> [~uwe]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7062) [C++] Parquet file parse error messages should include the file name

2019-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7062:
--
Labels: dataset parquet pull-request-available  (was: dataset parquet)

> [C++] Parquet file parse error messages should include the file name
> 
>
> Key: ARROW-7062
> URL: https://issues.apache.org/jira/browse/ARROW-7062
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Dataset
>Reporter: Neal Richardson
>Priority: Major
>  Labels: dataset, parquet, pull-request-available
> Fix For: 1.0.0
>
>
> ARROW-7061 was harder to diagnose than it should have been because the error 
> message was opaque and didn't tell me where to look.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7074) [C++] ASSERT_OK_AND_ASSIGN crashes when failing

2019-11-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7074:
--
Labels: pull-request-available  (was: )

> [C++] ASSERT_OK_AND_ASSIGN crashes when failing
> ---
>
> Key: ARROW-7074
> URL: https://issues.apache.org/jira/browse/ARROW-7074
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Developer Tools
>Affects Versions: 0.15.1
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Instead of simply failing the test, the {{ASSERT_OK_AND_ASSIGN}} macro 
> crashes when the operation failed, e.g.:
> {code}
> Value of: _st.ok()
>   Actual: false
> Expected: true
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> F1106 12:53:32.882110  4698 result.cc:28] ValueOrDie called on an error:  XXX
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7091) [C++] Move all factories to type_fwd.h

2019-11-07 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7091:
-

 Summary: [C++] Move all factories to type_fwd.h
 Key: ARROW-7091
 URL: https://issues.apache.org/jira/browse/ARROW-7091
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Antoine Pitrou
 Fix For: 1.0.0


There's no particular reason why parameter-less factories are in 
{{type_fwd.h}}, but the others in their respective implementation headers. By 
putting more factories in {{type_fwd.h}}, we may be able to avoid importing the 
heavier headers in some places.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7090) [C++] AssertFieldEqual (and friends) doesn't show metadata on failure

2019-11-07 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-7090:
-

 Summary: [C++] AssertFieldEqual (and friends) doesn't show 
metadata on failure
 Key: ARROW-7090
 URL: https://issues.apache.org/jira/browse/ARROW-7090
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


If two fields only differ by metadata the error message isn't very informative:
{code}
../src/arrow/testing/gtest_util.cc:147: Failure
Failed
left field: ints: int8 not null
right field: ints: int8 not null

{code}

Perhaps {{DataType::ToString}}, {{Field::ToString}} and {{Schema::ToString}} 
could get an optional flag to display metadata?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-11-07 Thread William Young (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969465#comment-16969465
 ] 

William Young commented on ARROW-1644:
--

Are there plans to merge this code? I have a use-case.

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 1.0.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-3408) [C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns

2019-11-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3408.
-
Resolution: Fixed

Issue resolved by pull request 5785
[https://github.com/apache/arrow/pull/5785]

> [C++] Add option to CSV reader to dictionary encode individual columns or all 
> string / binary columns
> -
>
> Key: ARROW-3408
> URL: https://issues.apache.org/jira/browse/ARROW-3408
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, C++ - Dataset
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> For many datasets, dictionary encoding everything can result in drastically 
> lower memory usage and subsequently better performance in doing analytics
> One difficulty of dictionary encoding in multithreaded conversions is that 
> ideally you end up with one dictionary at the end. So you have two options:
> * Implement a concurrent hashing scheme -- for low cardinality dictionaries, 
> the overhead associated with mutex contention will not be meaningful, for 
> high cardinality it can be more of a problem
> * Hash each chunk separately, then normalize at the end
> My guess is that a crude concurrent hash table with a mutex to protect 
> mutations and resizes is going to outperform the latter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969407#comment-16969407
 ] 

François Blanchard commented on ARROW-7087:
---

I will

> [Python] Table Metadata disappear when we write a partitioned dataset
> -
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
> Fix For: 1.0.0
>
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
> >> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
> >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": 
> >> []}')])
> """{code}
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7087:

Fix Version/s: 1.0.0

> [Python] Table Metadata disappear when we write a partitioned dataset
> -
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
> Fix For: 1.0.0
>
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
> >> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
> >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": 
> >> []}')])
> """{code}
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969404#comment-16969404
 ] 

Wes McKinney commented on ARROW-7087:
-

I would guess this relates to the table splitting logic dropping the metadata. 
Please feel free to submit a PR to fix

> [Python] Table Metadata disappear when we write a partitioned dataset
> -
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
> >> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
> >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": 
> >> []}')])
> """{code}
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7087) [Python] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-7087:

Summary: [Python] Table Metadata disappear when we write a partitioned 
dataset  (was: [Pyarrow] Table Metadata disappear when we write a partitioned 
dataset)

> [Python] Table Metadata disappear when we write a partitioned dataset
> -
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
> >> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
> >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": 
> >> []}')])
> """{code}
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7085) [C++][CSV] Add support for Extention type in csv reader

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969403#comment-16969403
 ] 

Wes McKinney commented on ARROW-7085:
-

[~fexolm] could you clarify what you need -- a custom ColumnBuilder? [~apitrou] 
should be able to advise you about this

> [C++][CSV] Add support for Extention type in csv reader
> ---
>
> Key: ARROW-7085
> URL: https://issues.apache.org/jira/browse/ARROW-7085
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Artem Alekseev
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969395#comment-16969395
 ] 

Wes McKinney commented on ARROW-6820:
-

I think we should have suggested names, but requiring certain names seems 
fraught. Since Map data might come from external sources (Spark, Parquet), I 
don't think it would be appropriate to overwrite the field names that might be 
used already in those sources. 

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7017) [C++] Refactor AddKernel to support other operations and types

2019-11-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969394#comment-16969394
 ] 

Wes McKinney commented on ARROW-7017:
-

Thanks, yes let's discuss more there. It seems like some investigation is 
indeed required. 

I think having LLVM as a build-time dependency is more palatable than as a 
runtime dependency in some applications. 

> [C++] Refactor AddKernel to support other operations and types
> --
>
> Key: ARROW-7017
> URL: https://issues.apache.org/jira/browse/ARROW-7017
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: analytics
>
> * Should avoid using builders (and/or NULLs) since the output shape is known 
> a compute time.
>  * Should be refatored to support other operations, e.g. Substraction, 
> Multiplication.
>  * Should have a overflow, underflow detection mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7089) [C++] In CMake output, list each enabled thirdparty toolchain dependency and the reason for its being enabled

2019-11-07 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7089:
---

 Summary: [C++] In CMake output, list each enabled thirdparty 
toolchain dependency and the reason for its being enabled
 Key: ARROW-7089
 URL: https://issues.apache.org/jira/browse/ARROW-7089
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


For example, for gtest it would say that it's enabled because 
ARROW_BUILD_TESTS=ON



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Blanchard updated ARROW-7087:
--
Attachment: (was: Capture d’écran 2019-11-07 à 16.46.37.png)

> [Pyarrow] Table Metadata disappear when we write a partitioned dataset
> --
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
> >> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
> >> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": 
> >> []}')])
> """{code}
>  
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7088) [C++][Python] gcc 4.8 / wheel builds failing after PARQUET-1678

2019-11-07 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7088:
---

 Summary: [C++][Python] gcc 4.8 / wheel builds failing after 
PARQUET-1678
 Key: ARROW-7088
 URL: https://issues.apache.org/jira/browse/ARROW-7088
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Wes McKinney
 Fix For: 1.0.0


See 
https://travis-ci.org/ursa-labs/crossbow/builds/608629511?utm_source=github_status_medium=notification

{code}
/usr/bin/ccache /opt/rh/devtoolset-2/root/usr/bin/c++  -DARROW_JEMALLOC 
-DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_USE_GLOG -DARROW_USE_SIMD 
-DARROW_WITH_ZSTD -DHAVE_INTTYPES_H -DHAVE_NETDB_H -DHAVE_NETINET_IN_H 
-DPARQUET_EXPORTING -DPARQUET_USE_BOOST_REGEX -Isrc -I/arrow/cpp/src 
-I/arrow/cpp/src/generated -isystem /arrow/cpp/thirdparty/flatbuffers/include 
-isystem /arrow_boost_dist/include -isystem /usr/local/include -isystem 
jemalloc_ep-prefix/src -isystem /arrow/cpp/thirdparty/hadoop/include -O3 
-DNDEBUG  -Wall -Wno-attributes -msse4.2  -O3 -DNDEBUG -fPIC   -std=gnu++11 -MD 
-MT src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -MF 
src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o.d -o 
src/parquet/CMakeFiles/parquet_objlib.dir/stream_reader.cc.o -c 
/arrow/cpp/src/parquet/stream_reader.cc
In file included from /arrow/cpp/src/parquet/stream_reader.h:31:0,
 from /arrow/cpp/src/parquet/stream_reader.cc:18:
/arrow/cpp/src/parquet/stream_writer.h:67:17: error: function 
‘parquet::StreamWriter& 
parquet::StreamWriter::operator=(parquet::StreamWriter&&)’ defaulted on its 
first declaration with an exception-specification that differs from the 
implicit declaration ‘parquet::StreamWriter& 
parquet::StreamWriter::operator=(parquet::StreamWriter&&)’
   StreamWriter& operator=(StreamWriter&&) noexcept = default;
 ^
In file included from /arrow/cpp/src/parquet/stream_reader.cc:18:0:
/arrow/cpp/src/parquet/stream_reader.h:61:17: error: function 
‘parquet::StreamReader& 
parquet::StreamReader::operator=(parquet::StreamReader&&)’ defaulted on its 
first declaration with an exception-specification that differs from the 
implicit declaration ‘parquet::StreamReader& 
parquet::StreamReader::operator=(parquet::StreamReader&&)’
   StreamReader& operator=(StreamReader&&) noexcept = default;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Blanchard updated ARROW-7087:
--
Description: 
There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
"""
Metadata is set as expected

>> OrderedDict([('data', 'test')])
"""

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
"""
Metadata with the key `data` is missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
"""{code}
 
  
  

  was:
There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
"""
Metadata is set as expected

>> OrderedDict([('data', 'test')])
"""

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
"""
Metadata with the key `data` are missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
"""{code}
 
  
  


> [Pyarrow] Table Metadata disappear when we write a partitioned dataset
> --
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` is missing
> >> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
> >> "pyarrow"}, "pandas_version": "0.22.0", 

[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Blanchard updated ARROW-7087:
--
Description: 
There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
"""
Metadata is set as expected

>> OrderedDict([('data', 'test')])
"""

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
"""
Metadata with the key `data` are missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
"""{code}
 
  
  

  was:
There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
'''
Metadata is set as expected

>> OrderedDict([('data', 'test')])
'''

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
'''
Metadata with the key `data` are missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
'''{code}
 
  
 


> [Pyarrow] Table Metadata disappear when we write a partitioned dataset
> --
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
> Attachments: Capture d’écran 2019-11-07 à 16.46.37.png
>
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> """
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> """
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> """
> Metadata with the key `data` are missing
> >> OrderedDict([('pandas', '{"creator": {"version": 

[jira] [Updated] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Blanchard updated ARROW-7087:
--
Description: 
There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
'''
Metadata is set as expected

>> OrderedDict([('data', 'test')])
'''

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
'''
Metadata with the key `data` are missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
'''{code}
 
  
 

  was:
There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
``` 
Metadata is set as expected

>> OrderedDict([('data', 'test')])
```

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
```
Metadata with the key `data` are missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
```{code}
 
 


> [Pyarrow] Table Metadata disappear when we write a partitioned dataset
> --
>
> Key: ARROW-7087
> URL: https://issues.apache.org/jira/browse/ARROW-7087
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: François Blanchard
>Priority: Major
> Attachments: Capture d’écran 2019-11-07 à 16.46.37.png
>
>
> There is an unexpected behavior with the method 
> *[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
>  in *pyarrow/parquet.py*
> When we write a table that contains metadata then metadata are replaced by 
> pandas metadata. This happens only if we defined *partition_cols*.
>  
> To be more explicit here is an example code: 
> {code:python}
> from pyarrow.parquet import write_to_dataset
> import pyarrow as pa
> import pyarrow.parquet as pd
> columnA = pa.array(['a', 'b', 'c'], type=pa.string())
> columnB = pa.array([1, 1, 2], type=pa.int32())
> # Build table from collumns
> table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 
> 'columnB'], metadata={'data': 'test'})
> print table.schema.metadata
> '''
> Metadata is set as expected
> >> OrderedDict([('data', 'test')])
> '''
> # Write table in parquet format partitioned per columnB
> write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])
> # Load data from parquet files
> ds = pd.ParquetDataset('/path/to/test')
> load_table = pq.read_table(ds.pieces[0].path)
> print load_table.schema.metadata
> '''
> Metadata with the key `data` are missing
> >> OrderedDict([('pandas', '{"creator": {"version": 

[jira] [Created] (ARROW-7087) [Pyarrow] Table Metadata disappear when we write a partitioned dataset

2019-11-07 Thread Jira
François Blanchard created ARROW-7087:
-

 Summary: [Pyarrow] Table Metadata disappear when we write a 
partitioned dataset
 Key: ARROW-7087
 URL: https://issues.apache.org/jira/browse/ARROW-7087
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: François Blanchard
 Attachments: Capture d’écran 2019-11-07 à 16.46.37.png

There is an unexpected behavior with the method 
*[write_to_dataset|https://github.com/apache/arrow/blob/10a3b716a5ca227c8d97e6f6b27976df14678263/python/pyarrow/parquet.py#L1373]*
 in *pyarrow/parquet.py*

When we write a table that contains metadata then metadata are replaced by 
pandas metadata. This happens only if we defined *partition_cols*.

 

To be more explicit here is an example code: 
{code:python}
from pyarrow.parquet import write_to_dataset
import pyarrow as pa
import pyarrow.parquet as pd

columnA = pa.array(['a', 'b', 'c'], type=pa.string())
columnB = pa.array([1, 1, 2], type=pa.int32())

# Build table from collumns
table = pa.Table.from_arrays([columnA, columnB], names=['columnA', 'columnB'], 
metadata={'data': 'test'})
print table.schema.metadata
``` 
Metadata is set as expected

>> OrderedDict([('data', 'test')])
```

# Write table in parquet format partitioned per columnB
write_to_dataset(table, '/path/to/test', partition_cols=['columnB'])

# Load data from parquet files
ds = pd.ParquetDataset('/path/to/test')
load_table = pq.read_table(ds.pieces[0].path)
print load_table.schema.metadata
```
Metadata with the key `data` are missing


>> OrderedDict([('pandas', '{"creator": {"version": "0.14.1", "library": 
>> "pyarrow"}, "pandas_version": "0.22.0", "index_columns": [], "columns": 
>> [{"metadata": null, "field_name": "columnA", "name": "columnA", 
>> "numpy_type": "object", "pandas_type": "unicode"}], "column_indexes": []}')])
```{code}
 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result

2019-11-07 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7086:

Description: 
There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(SafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning overload then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return RESULT_INVOKE(SafeAdd, a, b);
}
{code}

This will probably have to be a macro; otherwise the return type can be 
inferred but only when the function is not overloaded

  was:
There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(DoSafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning function then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return RESULT_INVOKE(DoSafeAdd, a, b);
}
{code}

This will probably have to be a macro; otherwise the return type can be 
inferred but only when the function is not overloaded


> [C++] Provide a wrapper for invoking factories to produce a Result
> --
>
> Key: ARROW-7086
> URL: https://issues.apache.org/jira/browse/ARROW-7086
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> There is a proliferation of code like:
> {code}
> Result SafeAdd(int a, int b) {
>   int out;
>   RETURN_NOT_OK(SafeAdd(a, b, ));
>   return out;
> }
> {code}
> Ideally, this should be resolved by moving the implementation of SafeAdd into 
> the Result returning overload then using {{Result::Value}} in the Status 
> returning function. In cases where this is inconvenient, it'd be helpful to 
> have an adapter for doing this more efficiently:
> {code}
> Result SafeAdd(int a, int b) {
>   return RESULT_INVOKE(SafeAdd, a, b);
> }
> {code}
> This will probably have to be a macro; otherwise the return type can be 
> inferred but only when the function is not overloaded



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result

2019-11-07 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7086:

Description: 
There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(DoSafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning function then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return RESULT_INVOKE(DoSafeAdd, a, b);
}
{code}

This will probably have to be a macro; otherwise the return type can be 
inferred but only when the function is not overloaded

  was:
There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(DoSafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning function then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return RESULT_INVOKE(DoSafeAdd, a, b);
}
{code}


> [C++] Provide a wrapper for invoking factories to produce a Result
> --
>
> Key: ARROW-7086
> URL: https://issues.apache.org/jira/browse/ARROW-7086
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> There is a proliferation of code like:
> {code}
> Result SafeAdd(int a, int b) {
>   int out;
>   RETURN_NOT_OK(DoSafeAdd(a, b, ));
>   return out;
> }
> {code}
> Ideally, this should be resolved by moving the implementation of SafeAdd into 
> the Result returning function then using {{Result::Value}} in the Status 
> returning function. In cases where this is inconvenient, it'd be helpful to 
> have an adapter for doing this more efficiently:
> {code}
> Result SafeAdd(int a, int b) {
>   return RESULT_INVOKE(DoSafeAdd, a, b);
> }
> {code}
> This will probably have to be a macro; otherwise the return type can be 
> inferred but only when the function is not overloaded



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result

2019-11-07 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-7086:

Description: 
There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(DoSafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning function then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return RESULT_INVOKE(DoSafeAdd, a, b);
}
{code}

  was:
There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(DoSafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning function then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return ResultInvoke(DoSafeAdd, a, b);
}
{code}


> [C++] Provide a wrapper for invoking factories to produce a Result
> --
>
> Key: ARROW-7086
> URL: https://issues.apache.org/jira/browse/ARROW-7086
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> There is a proliferation of code like:
> {code}
> Result SafeAdd(int a, int b) {
>   int out;
>   RETURN_NOT_OK(DoSafeAdd(a, b, ));
>   return out;
> }
> {code}
> Ideally, this should be resolved by moving the implementation of SafeAdd into 
> the Result returning function then using {{Result::Value}} in the Status 
> returning function. In cases where this is inconvenient, it'd be helpful to 
> have an adapter for doing this more efficiently:
> {code}
> Result SafeAdd(int a, int b) {
>   return RESULT_INVOKE(DoSafeAdd, a, b);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result

2019-11-07 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969288#comment-16969288
 ] 

Ben Kietzman commented on ARROW-7086:
-

[~emkornfield]

> [C++] Provide a wrapper for invoking factories to produce a Result
> --
>
> Key: ARROW-7086
> URL: https://issues.apache.org/jira/browse/ARROW-7086
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> There is a proliferation of code like:
> {code}
> Result SafeAdd(int a, int b) {
>   int out;
>   RETURN_NOT_OK(DoSafeAdd(a, b, ));
>   return out;
> }
> {code}
> Ideally, this should be resolved by moving the implementation of SafeAdd into 
> the Result returning function then using {{Result::Value}} in the Status 
> returning function. In cases where this is inconvenient, it'd be helpful to 
> have an adapter for doing this more efficiently:
> {code}
> Result SafeAdd(int a, int b) {
>   return ResultInvoke(DoSafeAdd, a, b);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7086) [C++] Provide a wrapper for invoking factories to produce a Result

2019-11-07 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-7086:
---

 Summary: [C++] Provide a wrapper for invoking factories to produce 
a Result
 Key: ARROW-7086
 URL: https://issues.apache.org/jira/browse/ARROW-7086
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.15.1
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 1.0.0


There is a proliferation of code like:

{code}
Result SafeAdd(int a, int b) {
  int out;
  RETURN_NOT_OK(DoSafeAdd(a, b, ));
  return out;
}
{code}

Ideally, this should be resolved by moving the implementation of SafeAdd into 
the Result returning function then using {{Result::Value}} in the Status 
returning function. In cases where this is inconvenient, it'd be helpful to 
have an adapter for doing this more efficiently:

{code}
Result SafeAdd(int a, int b) {
  return ResultInvoke(DoSafeAdd, a, b);
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4631) [C++] Implement serial version of sort computational kernel

2019-11-07 Thread Artem Alekseev (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Alekseev reassigned ARROW-4631:
-

Assignee: (was: Artem Alekseev)

> [C++] Implement serial version of sort computational kernel
> ---
>
> Key: ARROW-4631
> URL: https://issues.apache.org/jira/browse/ARROW-4631
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Areg Melik-Adamyan
>Priority: Major
>  Labels: analytics
> Fix For: 1.0.0
>
>
> Implement serial version of sort computational kernel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7085) [C++][CSV] Add support for Extention type in csv reader

2019-11-07 Thread Artem Alekseev (Jira)
Artem Alekseev created ARROW-7085:
-

 Summary: [C++][CSV] Add support for Extention type in csv reader
 Key: ARROW-7085
 URL: https://issues.apache.org/jira/browse/ARROW-7085
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Artem Alekseev






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3410) [C++][Dataset] Streaming CSV reader interface for memory-constrainted environments

2019-11-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969260#comment-16969260
 ] 

Antoine Pitrou commented on ARROW-3410:
---

> Due to previous point, ScanTask should not hold memory until consumed

Hmm... to define the blocks in a CSV file, I have to read the CSV file 
entirely. So if memory isn't held, then each ScanTask will have to read the CSV 
file a second time. This may not be a big problem (but still suboptimal - 
memory copies) if the CSV file stays in the filesystem cache, but what about a 
huge CSV file?

The only reasonable way to ingest a CSV file in parallel is to do the chunking 
while reading the file, AFAIK.

> ScanTask are expected to be bound to a single thread and shouldn't have 
> nested parallelism.

Why that? It shouldn't be a problem if using the global thread pool.


> [C++][Dataset] Streaming CSV reader interface for memory-constrainted 
> environments
> --
>
> Key: ARROW-3410
> URL: https://issues.apache.org/jira/browse/ARROW-3410
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, C++ - Dataset
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset
>
> CSV reads are currently all-or-nothing. If the results of parsing a CSV file 
> do not fit into memory, this can be a problem. I propose to define a 
> streaming {{RecordBatchReader}} interface so that the record batches produced 
> by reading can be written out immediately to a stream on disk, to be memory 
> mapped later



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3410) [C++][Dataset] Streaming CSV reader interface for memory-constrainted environments

2019-11-07 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969243#comment-16969243
 ] 

Francois Saint-Jacques commented on ARROW-3410:
---

Ideally not a RecordBatch iterator. Looking at `file_parquet.cc` is your best 
bet.

# CSV reader's options should be in an instance of CSVFileFormat.
# Implement `CSVFileFormat::Inspect()`, this is needed to "peek" the Schema of 
a file. It should be possible to limit the number of rows parsed (in the 
constructor of CSVFileFormat) for the inspect call.
# Implement `CSVFileFormat::ScanFile()`. This returns a ScanTaskIterator. A 
ScanTask is a closure that yields an Iterator.

Some expected requirements by callers (Scanner::ToTable()) of ScanFile:
* ScanFile should be fast-ish. It is used to enumerate all ScanTasks before 
dispatching to the thread pool. It is ran serially over all fragments in a 
DataSource (this could change).
* Due to previous point, ScanTask should not hold memory until consumed (in 
parquet, it only holds the row_group_id). In the case of CSV, it might be that 
the Blocks are referenced by (offset, length) instead of a shared_ptr.
* ScanTask are expected to be bound to a single thread and shouldn't have 
nested parallelism.
* No inference should be done, the user _always_ pass an explicit schema at the 
DataSource construction time. 
* Ensure that column subset projection is properly done, (see 
InferColumnProjection in parquet). This is probably the only optimization we 
can make for now, there's nothing much we can do about predicate pushdown.

The way I foresee how it is implement is the following:
* The CSV parser divides the file in blocks in ScanFile(), each block is bound 
to a ScanTask. As noted, this needs to be done in a fashion that does not hold 
memory.
* A ScanTask parses a block an yields one-or-more RecordBatch.

This is very similar to the current ThreadedReader with some differences:
* Inversion of control, it yields tasks instead of dispatching them directly.
* The Block iterator must not be blocking and not hold buffers.

> [C++][Dataset] Streaming CSV reader interface for memory-constrainted 
> environments
> --
>
> Key: ARROW-3410
> URL: https://issues.apache.org/jira/browse/ARROW-3410
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, C++ - Dataset
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset
>
> CSV reads are currently all-or-nothing. If the results of parsing a CSV file 
> do not fit into memory, this can be a problem. I propose to define a 
> streaming {{RecordBatchReader}} interface so that the record batches produced 
> by reading can be written out immediately to a stream on disk, to be memory 
> mapped later



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7083) [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels

2019-11-07 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-7083:
--
Description: 
See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]

 

Requirements:

1.  No hard runtime dependency on LLVM

2.  Ability to run without LLVM static/shared libraries.

 

Open questions:

1.  What dependencies does this add to the build tool chain?

  was:
See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]

 

Requirements:

1.  No hard runtime dependency on LLVM

2.  Ability to run without JIT.

 

Open questions:

1.  What dependencies does this add to the build tool chain?


> [C++] Determine the feasibility and build a prototype to replace 
> compute/kernels with gandiva kernels
> -
>
> Key: ARROW-7083
> URL: https://issues.apache.org/jira/browse/ARROW-7083
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Compute, C++ - Gandiva
>Reporter: Micah Kornfield
>Priority: Major
>
> See discussion on [https://issues.apache.org/jira/browse/ARROW-7017]
>  
> Requirements:
> 1.  No hard runtime dependency on LLVM
> 2.  Ability to run without LLVM static/shared libraries.
>  
> Open questions:
> 1.  What dependencies does this add to the build tool chain?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969186#comment-16969186
 ] 

Antoine Pitrou commented on ARROW-6820:
---

Names might become significant in some contexts, for example if data is 
converted into other formats. Regardless, the inconsistency is a bit confusing.


> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3410) [C++][Dataset] Streaming CSV reader interface for memory-constrainted environments

2019-11-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969181#comment-16969181
 ] 

Antoine Pitrou commented on ARROW-3410:
---

[~fsaintjacques] What kind of API would Datasets need from a streaming CSV 
reader? A RecordBatch iterator? Something else?

> [C++][Dataset] Streaming CSV reader interface for memory-constrainted 
> environments
> --
>
> Key: ARROW-3410
> URL: https://issues.apache.org/jira/browse/ARROW-3410
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, C++ - Dataset
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataset
>
> CSV reads are currently all-or-nothing. If the results of parsing a CSV file 
> do not fit into memory, this can be a problem. I propose to define a 
> streaming {{RecordBatchReader}} interface so that the record batches produced 
> by reading can be written out immediately to a stream on disk, to be memory 
> mapped later



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7084) [C++] ArrayRangeEquals should check for full type equality?

2019-11-07 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-7084:
--

 Summary: [C++]  ArrayRangeEquals should check for full type 
equality?
 Key: ARROW-7084
 URL: https://issues.apache.org/jira/browse/ARROW-7084
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Micah Kornfield


It looks like ArrayRangeEquals in compare.cc only checks type IDs before doing 
comparison actual values.  This is inconsistent with ArrayEquals which checks 
for type equality and also seems incorrect for cases like Decimal128. 

 

I presume this was an oversight when fixing ARROW-2567 but maybe it was 
intentional?

[~uwe]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969024#comment-16969024
 ] 

Micah Kornfield commented on ARROW-6820:


/// The names of the
/// child fields *may* be respectively "entry", "key", and "value", *but this 
is*
*/// not enforced*

I'm not sure I understand the issue.  The way I read the spec, naming is not 
enforced.  See bolded section.

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)