date:20190828

[jira] [Updated] (ARROW-6317) [JS] Implement changes to ensure flatbuffer alignment

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6317:
--
Labels: pull-request-available  (was: )

> [JS] Implement changes to ensure flatbuffer alignment
> -
>
> Key: ARROW-6317
> URL: https://issues.apache.org/jira/browse/ARROW-6317
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: JavaScript
>Reporter: Micah Kornfield
>Assignee: Paul Taylor
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> See description in parent bug on requirements.
> [~bhulette] or [~paul.e.taylor] do you think one of you would be able to pick 
> this up for 0.15.0



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[GitHub] [arrow-testing] emkornfield commented on issue #9: ARROW-6318: add generated files from integration test to testing

2019-08-28 Thread GitBox

emkornfield commented on issue #9: ARROW-6318: add generated files from 
integration test to testing 
URL: https://github.com/apache/arrow-testing/pull/9#issuecomment-526026090
 
 
   @wesm anything else you would like to see done?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (ARROW-6372) [Rust][Datafusion] Casting from Un-signed to Signed Integers not supported

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6372:
--
Labels: beginner pull-request-available  (was: beginner)

> [Rust][Datafusion] Casting from Un-signed to Signed Integers not supported
> --
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6372) [Rust][Datafusion] Casting from Un-signed to Signed Integers not supported

2019-08-28 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan updated ARROW-6372:
---
Summary: [Rust][Datafusion] Casting from Un-signed to Signed Integers not 
supported  (was: [Rust][Datafusion] Predictate push down optimization can break 
query plan)

> [Rust][Datafusion] Casting from Un-signed to Signed Integers not supported
> --
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-28 Thread Paddy Horan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918234#comment-16918234
 ] 

Paddy Horan commented on ARROW-6372:


Oh that's interesting.  I was trying to give a minimal example of an issue I am 
having in another project and I believe the example highlights a different 
issue.

I'll take this one on, I thought it was more complex.

> [Rust][Datafusion] Predictate push down optimization can break query plan
> -
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-28 Thread Paddy Horan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paddy Horan reassigned ARROW-6372:
--

Assignee: Paddy Horan

> [Rust][Datafusion] Predictate push down optimization can break query plan
> -
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6348) [R] arrow::read_csv_arrow namespace error when package not loaded

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6348:
--
Labels: pull-request-available  (was: )

> [R] arrow::read_csv_arrow namespace error when package not loaded
> -
>
> Key: ARROW-6348
> URL: https://issues.apache.org/jira/browse/ARROW-6348
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: hugh marera
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> {quote}I am not sure if the arrow::read_csv_arrow() error below is a bug or a 
> feature?
>  
> data("iris")
>  write.csv(iris, "iris.csv")
>  test <- arrow::read_csv_arrow("iris.csv")
>  Error in read_delim_arrow(file = "iris.csv", delim = ",") :
>  could not find function "read_delim_arrow"
>  test <- arrow::read_delim_arrow("iris.csv")
>  sessionInfo()
>  R version 3.6.1 (2019-07-05)
>  Platform: x86_64-apple-darwin18.6.0 (64-bit)
>  Running under: macOS Mojave 10.14.6
> {quote}
> Matrix products: default
>  BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib
> locale:
>  [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
>  [1] stats graphics grDevices utils datasets
>  [6] methods base
> loaded via a namespace (and not attached):
>  [1] tidyselect_0.2.5 bit_1.1-14 compiler_3.6.1
>  [4] magrittr_1.5 assertthat_0.2.1 R6_2.4.0
>  [7] tools_3.6.1 fs_1.3.1 glue_1.3.1
>  [10] Rcpp_1.0.2 bit64_0.9-7 arrow_0.14.1.1
>  [13] rlang_0.4.0 purrr_0.3.2



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6348) [R] arrow::read_csv_arrow namespace error when package not loaded

2019-08-28 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6348:
--

Assignee: Neal Richardson

> [R] arrow::read_csv_arrow namespace error when package not loaded
> -
>
> Key: ARROW-6348
> URL: https://issues.apache.org/jira/browse/ARROW-6348
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: hugh marera
>Assignee: Neal Richardson
>Priority: Minor
> Fix For: 0.15.0
>
>
> {quote}I am not sure if the arrow::read_csv_arrow() error below is a bug or a 
> feature?
>  
> data("iris")
>  write.csv(iris, "iris.csv")
>  test <- arrow::read_csv_arrow("iris.csv")
>  Error in read_delim_arrow(file = "iris.csv", delim = ",") :
>  could not find function "read_delim_arrow"
>  test <- arrow::read_delim_arrow("iris.csv")
>  sessionInfo()
>  R version 3.6.1 (2019-07-05)
>  Platform: x86_64-apple-darwin18.6.0 (64-bit)
>  Running under: macOS Mojave 10.14.6
> {quote}
> Matrix products: default
>  BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib
> locale:
>  [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
>  [1] stats graphics grDevices utils datasets
>  [6] methods base
> loaded via a namespace (and not attached):
>  [1] tidyselect_0.2.5 bit_1.1-14 compiler_3.6.1
>  [4] magrittr_1.5 assertthat_0.2.1 R6_2.4.0
>  [7] tools_3.6.1 fs_1.3.1 glue_1.3.1
>  [10] Rcpp_1.0.2 bit64_0.9-7 arrow_0.14.1.1
>  [13] rlang_0.4.0 purrr_0.3.2



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6379) [C++] Do not append any buffers when serializing NullType for IPC

2019-08-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918208#comment-16918208
 ] 

Wes McKinney commented on ARROW-6379:
-

We should discuss this on the mailing list. It would cause old IPC messages 
containing null types to be unreadable, but we never had integration tests 
anyway

> [C++] Do not append any buffers when serializing NullType for IPC
> -
>
> Key: ARROW-6379
> URL: https://issues.apache.org/jira/browse/ARROW-6379
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Currently we send a length-0 buffer. It would be better to not include any 
> buffers. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6381) [C++] BufferOutputStream::Write is slow for many small writes

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6381:
--
Labels: pull-request-available  (was: )

> [C++] BufferOutputStream::Write is slow for many small writes
> -
>
> Key: ARROW-6381
> URL: https://issues.apache.org/jira/browse/ARROW-6381
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> {{Write}} calls into {{BufferOutputStream::Reserve}} which does a surprising 
> amount of work. I suggest streamlining the implementation and adding 
> benchmarks for the many-small-writes case



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5995) [Python] pyarrow: hdfs: support file checksum

2019-08-28 Thread Max Risuhin (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918152#comment-16918152
 ] 

Max Risuhin commented on ARROW-5995:


Arrow codebase seems supports hdfs access by utilizing 2 different drivers - 
libhdfs3 and official C based library distributed with hadoop - libhdfs.

Since further support of libhdfs3 is not in plans, official libhdfs is the only 
option.

Bad news is that libhdfs doesn't have C API to retrieve checksum. It's supposed 
that [libhdfs C API should be just subset of Hadoop FileSystem 
APIs|[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html#The_APIs].]

Relevant C API can be observed from 
[there|[https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfs/include/hdfs/hdfs.h].]
 And, unfortunately, I can't see any checksum related field in retrieved data 
structures or dedicated API function. ( It should look somewhat like following: 
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)
 )

 

[~efiop] it seems that  the missed getFileChecksum API function is the main 
reason why this functionality is not available through Arrow.

Straightforward, but long lasting to implement solution would be extension of 
libhdfs with getFileChecksum.

Another possibility, probably, is to calculate checksum based on available API 
calls (open, read, etc). But it doesn't sound like efficient approach.

 

> [Python] pyarrow: hdfs: support file checksum
> -
>
> Key: ARROW-5995
> URL: https://issues.apache.org/jira/browse/ARROW-5995
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Ruslan Kuprieiev
>Priority: Minor
>
> I was not able to find how to retrieve checksum (`getFileChecksum` or `hadoop 
> fs/dfs -checksum`) for a file on hdfs. Judging by how it is implemented in 
> hadoop CLI [1], looks like we will also need to implement it manually in 
> pyarrow. Please correct me if I'm missing something. Is this feature 
> desirable? Or was there a good reason why it wasn't implemented already?
>  [1] 
> [https://github.com/hanborq/hadoop/blob/hadoop-hdh3u2.1/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java#L719]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6381) [C++] BufferOutputStream::Write is slow for many small writes

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6381:
---

Assignee: Wes McKinney

> [C++] BufferOutputStream::Write is slow for many small writes
> -
>
> Key: ARROW-6381
> URL: https://issues.apache.org/jira/browse/ARROW-6381
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Critical
> Fix For: 0.15.0
>
>
> {{Write}} calls into {{BufferOutputStream::Reserve}} which does a surprising 
> amount of work. I suggest streamlining the implementation and adding 
> benchmarks for the many-small-writes case



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6381) [C++] BufferOutputStream::Write is slow for many small writes

2019-08-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6381:
---

 Summary: [C++] BufferOutputStream::Write is slow for many small 
writes
 Key: ARROW-6381
 URL: https://issues.apache.org/jira/browse/ARROW-6381
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


{{Write}} calls into {{BufferOutputStream::Reserve}} which does a surprising 
amount of work. I suggest streamlining the implementation and adding benchmarks 
for the many-small-writes case



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5176) [Python] Automate formatting of python files

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5176:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5176) [Python] Automate formatting of python files

2019-08-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918145#comment-16918145
 ] 

Wes McKinney commented on ARROW-5176:
-

This isn't urgent to address for 0.15.0

> [Python] Automate formatting of python files
> 
>
> Key: ARROW-5176
> URL: https://issues.apache.org/jira/browse/ARROW-5176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Benjamin Kietzman
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> [Black](https://github.com/ambv/black) is a tool for automatically formatting 
> python code in ways which flake8 and our other linters approve of. Adding it 
> to the project will allow more reliably formatted python code and fill a 
> similar role to {{clang-format}} for c++ and {{cmake-format}} for cmake



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4359) [Python] Column metadata is not saved or loaded in parquet

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4359:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [Python] Column metadata is not saved or loaded in parquet
> --
>
> Key: ARROW-4359
> URL: https://issues.apache.org/jira/browse/ARROW-4359
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Seb Fru
>Priority: Major
>  Labels: parquet
> Fix For: 1.0.0
>
>
> Hi all,
> a while ago I posted this issue: ARROW-3866
> While working with Pyarrow I encountered another potential bug related to 
> column metadata: If I create a table containing columns with metadata 
> everything is fine. But after I save the table to parquet and load it back as 
> a table using pq.read_table, the column metadata is gone.
>  
> As of now I can not say yet whether the metadata is not saved correctly or 
> not loaded correctly, as I have no idea how to verify it. Unfortunately I 
> also don't have the time try a lot, but I wanted to let you know anyway. 
>  
> {code}
> field0 = pa.field('field1', pa.int64(), metadata=dict(a="A", b="B"))
> field1 = pa.field('field2', pa.int64(), nullable=False)
> columns = [
> pa.column(field0, pa.array([1, 2])),
> pa.column(field1, pa.array([3, 4]))
> ]
> table = pa.Table.from_arrays(columns)
> pq.write_table(tab, path)
> tab2 = pq.read_table(path)
> tab2.column(0).field.metadata
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-3750) [R] Pass various wrapped Arrow objects created in Python into R with zero copy via reticulate

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3750:

Fix Version/s: (was: 0.15.0)
   1.0.0

> [R] Pass various wrapped Arrow objects created in Python into R with zero 
> copy via reticulate
> -
>
> Key: ARROW-3750
> URL: https://issues.apache.org/jira/browse/ARROW-3750
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> A user may wish to use some functionality available only in pyarrow using 
> reticulate; it would be useful to be able to construct an R wrapper object to 
> the C++ object inside the corresponding Python type, e.g. {{pyarrow.Table}}. 
> This probably will require some new functions to return the memory address of 
> the shared_ptr/unique_ptr inside the Cython types so that a function on the R 
> side can copy the smart pointer and create the corresponding R wrapper type
> cc [~pitrou]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-2041) [Python] pyarrow.serialize has high overhead for list of NumPy arrays

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2041.
---
Resolution: Won't Fix

I looked at this. The breakdown of parts of each NumPy arrays is as follows:

* 400 bytes of data
* 176 bytes of metadata
* 64 bytes of padding

So there's a few things that can be done, and you're free to open some JIRA 
issues:

* Create more compact metadata for Tensors -- Protocol Buffers could be a good 
option that's smaller than that Flatbuffers table that's produced. 
* Reduce the padding requirement to 8 bytes instead of 64 bytes. 

> [Python] pyarrow.serialize has high overhead for list of NumPy arrays
> -
>
> Key: ARROW-2041
> URL: https://issues.apache.org/jira/browse/ARROW-2041
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Richard Shin
>Priority: Minor
>  Labels: Performance
> Fix For: 0.15.0
>
>
> {{Python 2.7.12 (default, Nov 20 2017, 18:23:56)}}
> {{[GCC 5.4.0 20160609] on linux2}}
> {{Type "help", "copyright", "credits" or "license" for more information.}}
> {{>>> import pyarrow as pa, numpy as np}}
> {{>>> arrays = [np.arange(100, dtype=np.int32) for _ in range(1)]}}
> {{>>> with open('test.pyarrow', 'w') as f:}}
> {{... f.write(pa.serialize(arrays).to_buffer().to_pybytes())}}
> {{...}}
> {{>>> import cPickle as pickle}}
> {{>>> pickle.dump(arrays, open('test.pkl', 'w'), pickle.HIGHEST_PROTOCOL)}}
> test.pyarrow is 6.2 MB, while test.pkl is only 4.2 MB.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (ARROW-2006) [C++] Add option to trim excess padding when writing IPC messages

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2006.
---
Resolution: Not A Problem

I took a look at this. The original issue report regarding the large size of 
{{pyarrow.serialize(1)}} has to do with the complexity of the record batch that 
is being internally serialized -- the metadata itself is somewhat large. It's 
not caused by excess buffer padding. To make this smaller, we'd have to 
redesign the serialization format.

> [C++] Add option to trim excess padding when writing IPC messages
> -
>
> Key: ARROW-2006
> URL: https://issues.apache.org/jira/browse/ARROW-2006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> This will help with situations like 
> [https://github.com/apache/arrow/issues/1467] where we don't really need the 
> extra padding bytes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-2006) [C++] Add option to trim excess padding when writing IPC messages

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2006:
---

Assignee: Wes McKinney

> [C++] Add option to trim excess padding when writing IPC messages
> -
>
> Key: ARROW-2006
> URL: https://issues.apache.org/jira/browse/ARROW-2006
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> This will help with situations like 
> [https://github.com/apache/arrow/issues/1467] where we don't really need the 
> extra padding bytes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-921) [Developer] Add scripts to facilitate integration testing between revisions of the codebase

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-921:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [Developer] Add scripts to facilitate integration testing between revisions 
> of the codebase
> ---
>
> Key: ARROW-921
> URL: https://issues.apache.org/jira/browse/ARROW-921
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, Integration
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently we are integration testing between implementations on the master 
> branch and in pull requests. It would be good to have the ability to 
> integration test between arbitrary code revisions (e.g. between master and 
> released versions on only the C++ or Java implementations). This will create 
> more transparency around whether or not changes break binary forward 
> compatibility. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6342) [Python] Add pyarrow.record_batch factory function with same basic API / semantics as pyarrow.table

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6342.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/05bc63c3b76a5fc865434f596c63ff0f26388f69

> [Python] Add pyarrow.record_batch factory function with same basic API / 
> semantics as pyarrow.table
> ---
>
> Key: ARROW-6342
> URL: https://issues.apache.org/jira/browse/ARROW-6342
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> I find writing {{pa.RecordBatch.from_arrays}} to be a bit tedious, especially 
> now that we have {{pa.table(data, names=names)}}. This would be a usability 
> improvement. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6263) [Python] RecordBatch.from_arrays does not check array types against a passed schema

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6263.
-
Resolution: Fixed

Issue resolved by pull request 5189
[https://github.com/apache/arrow/pull/5189]

> [Python] RecordBatch.from_arrays does not check array types against a passed 
> schema
> ---
>
> Key: ARROW-6263
> URL: https://issues.apache.org/jira/browse/ARROW-6263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Example came from ARROW-6038
> {code}
> In [4]: pa.RecordBatch.from_arrays([pa.array([])], schema)
>   
> Out[4]: 
> In [5]: rb = pa.RecordBatch.from_arrays([pa.array([])], schema)   
>   
> In [6]: rb
>   
> Out[6]: 
> In [7]: rb.schema 
>   
> Out[7]: col: string
> In [8]: rb[0] 
>   
> Out[8]: 
> 
> 0 nulls
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6376) [Developer] PR merge script has "master" target ref hard-coded

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6376.
-
Resolution: Fixed

Issue resolved by pull request 5220
[https://github.com/apache/arrow/pull/5220]

> [Developer] PR merge script has "master" target ref hard-coded
> --
>
> Key: ARROW-6376
> URL: https://issues.apache.org/jira/browse/ARROW-6376
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If the target ref of a PR is something other than master, we should merge PRs 
> into that branch



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-5522) [Packaging][Documentation] Comments out of date in python/manylinux1/build_arrow.sh

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5522.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5215
[https://github.com/apache/arrow/pull/5215]

> [Packaging][Documentation] Comments out of date in 
> python/manylinux1/build_arrow.sh
> ---
>
> Key: ARROW-5522
> URL: https://issues.apache.org/jira/browse/ARROW-5522
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Packaging, Python
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available, wheel
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The script has this comment:
> {code:java}
> # Usage:
> #   docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh
> {code}
> However, I get:
> {code}
> Unable to find image 'arrow-base-x86_64:latest' locally
> docker: Error response from daemon: pull access denied for arrow-base-x86_64, 
> repository does not exist or may require 'docker login'.
> See 'docker run --help'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-4510) [Format] copy content from IPC.rst to new document.

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4510.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/67d46c7149115ea1ab094ab80f1e1ff4add48be9

> [Format] copy content from IPC.rst to new document.
> ---
>
> Key: ARROW-4510
> URL: https://issues.apache.org/jira/browse/ARROW-4510
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Format
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-4509) [Format] Copy content from Metadata.rst to new document.

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4509.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/67d46c7149115ea1ab094ab80f1e1ff4add48be9

> [Format] Copy content from Metadata.rst to new document.
> 
>
> Key: ARROW-4509
> URL: https://issues.apache.org/jira/browse/ARROW-4509
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Format
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-1789) [Format] Consolidate specification documents and improve clarity for new implementation authors

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1789.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/67d46c7149115ea1ab094ab80f1e1ff4add48be9

> [Format] Consolidate specification documents and improve clarity for new 
> implementation authors
> ---
>
> Key: ARROW-1789
> URL: https://issues.apache.org/jira/browse/ARROW-1789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1296
> I believe the specification documents Layout.md, Metadata.md, and IPC.md 
> would benefit from being consolidated into a single Markdown document that 
> would be sufficient (along with the Flatbuffers schemas) to create a complete 
> Arrow implementation capable of reading and writing the binary format



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-4507) [Format] Create outline and introduction for new document.

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4507.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/67d46c7149115ea1ab094ab80f1e1ff4add48be9.
 I think this could be improved further

> [Format] Create outline and introduction for new document.
> --
>
> Key: ARROW-4507
> URL: https://issues.apache.org/jira/browse/ARROW-4507
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Format
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This will ensure the document has a good flow, other subtasks on the parent 
> will handle moving content from each of the documents.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-4508) [Format] Copy content from Layout.rst to new document.

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4508.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/67d46c7149115ea1ab094ab80f1e1ff4add48be9

> [Format] Copy content from Layout.rst to new document.
> --
>
> Key: ARROW-4508
> URL: https://issues.apache.org/jira/browse/ARROW-4508
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Format
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-4511) [Format] remove individual documents in favor of new document once all content is moved

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4511.
-
Fix Version/s: (was: 1.0.0)
   0.15.0
   Resolution: Fixed

Issue resolved by pull request 5202
[https://github.com/apache/arrow/pull/5202]

> [Format] remove individual documents in favor of new document once all 
> content is moved
> ---
>
> Key: ARROW-4511
> URL: https://issues.apache.org/jira/browse/ARROW-4511
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Format
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>
> We might want to leave the documents in place and provide links to the new 
> consolidated document in case others are linking to published content.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6354) [C++] Building without Parquet fails

2019-08-28 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-6354.
---
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5218
[https://github.com/apache/arrow/pull/5218]

> [C++] Building without Parquet fails
> 
>
> Key: ARROW-6354
> URL: https://issues.apache.org/jira/browse/ARROW-6354
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Seems like this is a recent regression:
> {code}
> [214/300] Building CXX object 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o
> FAILED: 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o 
> /usr/bin/ccache /usr/bin/g++-7  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_USE_SIMD 
> -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY 
> -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DAWS_COMMON_USE_IMPORT_EXPORT 
> -DAWS_EVENT_STREAM_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 
> -DAWS_SDK_VERSION_MINOR=7 -DAWS_SDK_VERSION_PATCH=160 -Isrc -I../src -isystem 
> /home/antoine/miniconda3/envs/pyarrow/include -isystem 
> double-conversion_ep/src/double-conversion_ep/include -isystem 
> ../thirdparty/hadoop/include -Wno-noexcept-type  -fdiagnostics-color=always 
> -ggdb -O0  -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable 
> -Werror -msse4.2  -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 
> -fno-omit-frame-pointer -g -fPIE   -pthread -std=gnu++11 -MD -MT 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o 
> -MF 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o.d
>  -o 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o 
> -c ../src/arrow/dataset/dataset_test.cc
> In file included from ../src/parquet/arrow/writer.h:25:0,
>  from ../src/arrow/dataset/test_util.h:27,
>  from ../src/arrow/dataset/dataset_test.cc:20:
> ../src/parquet/properties.h:30:10: fatal error: parquet/parquet_version.h: 
> Aucun fichier ou dossier de ce type
>  #include "parquet/parquet_version.h"
>   ^~~
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5618) [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers integer overflow in some cases

2019-08-28 Thread TP Boudreau (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918032#comment-16918032
 ] 

TP Boudreau commented on ARROW-5618:


Ugh, sorry, forgot all about this PR and the comment from earlier this month 
slipped through the cracks.

I'll revisit this issue sometime this week.



> [C++] [Parquet] Using deprecated Int96 storage for timestamps triggers 
> integer overflow in some cases
> -
>
> Key: ARROW-5618
> URL: https://issues.apache.org/jira/browse/ARROW-5618
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: TP Boudreau
>Assignee: TP Boudreau
>Priority: Minor
>  Labels: parquet, pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When storing Arrow timestamps in Parquet files using the Int96 storage 
> format, certain combinations of array lengths and validity bitmasks cause an 
> integer overflow error on read.  It's not immediately clear whether the 
> Arrow/Parquet writer is storing zeroes when it should be storing positive 
> values or the reader is attempting to calculate a nanoseconds value 
> inappropriately from zeroed inputs (perhaps missing the null bit flag).  Also 
> not immediately clear why only certain length columns seem to be affected.
> Probably the quickest way to reproduce this undefined behavior is to alter 
> the existing unit test UseDeprecatedInt96 (in file 
> .../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling 
> its column lengths (repeating the same values), followed by 'make unittest' 
> using clang-7 with sanitizers enabled.  (Here's a patch applicable to current 
> master that changes the test as described: [1]; I used the following cmake 
> command to build my environment: [2].)  You should get a log something like 
> [3].  If requested, I'll see if I can put together a stand-alone minimal test 
> case that induces the behavior.
> The quick-hack at [4] will prevent integer overflows, but this is only 
> included to confirm the proximate cause of the bug: the Julian days field of 
> the Int96 appears to be zero, when a strictly positive number is expected.
> I've assigned the issue to myself and I'll start looking into the root cause 
> of this.
> [1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
> [2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
> [3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
> [4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6380) Method pyarrow.parquet.read_table has memory spikes from version 0.14

2019-08-28 Thread Renan Alves Fonseca (Jira)

Renan Alves Fonseca created ARROW-6380:
--

 Summary: Method pyarrow.parquet.read_table has memory spikes from 
version 0.14
 Key: ARROW-6380
 URL: https://issues.apache.org/jira/browse/ARROW-6380
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1, 0.14.0
 Environment: ubuntu 18, 16GB ram, 4 cpus
Reporter: Renan Alves Fonseca
 Fix For: 0.13.0


Method pyarrow.parquet.read_table is very slow and cause RAM spikes from 
version 0.14.0

Reading a 40MB parquet file takes less than 1 second in versions 0.11, 0.12 and 
0.13. wheras it takes from 6 to 30 seconds in versions 0.14.x

This impact in performance is easily measured. However, there is another 
problem that I could only detect on htop screen. While opening a 40MB parquet, 
the process occupies almost 16GB for some miliseconds. The pyarrow table will 
result in around 300MB in the python process (registered using 
memory-profiler). This does not happens in versions 0.13 and previous ones.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6378) [C++][Dataset] Implement TreeDataSource

2019-08-28 Thread Francois Saint-Jacques (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917991#comment-16917991
 ] 

Francois Saint-Jacques commented on ARROW-6378:
---

We're thinking of yanking PartitionDataSource and make `GetKey()/SetKey()` part 
of the DataSource interface directly. See 
[https://github.com/apache/arrow/pull/5221]

> [C++][Dataset] Implement TreeDataSource
> ---
>
> Key: ARROW-6378
> URL: https://issues.apache.org/jira/browse/ARROW-6378
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> The TreeDataSource is required to support partitions pruning of sub-trees.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6379) [C++] Do not append any buffers when serializing NullType for IPC

2019-08-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6379:
---

 Summary: [C++] Do not append any buffers when serializing NullType 
for IPC
 Key: ARROW-6379
 URL: https://issues.apache.org/jira/browse/ARROW-6379
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


Currently we send a length-0 buffer. It would be better to not include any 
buffers. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6244) [C++] Implement Partition DataSource

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6244:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++] Implement Partition DataSource
> 
>
> Key: ARROW-6244
> URL: https://issues.apache.org/jira/browse/ARROW-6244
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
>
> This is a DataSource that also has partition metadata. The end goal is to 
> support filtering with a DataSelector/Filter expression. The initial 
> implementation should not deal with PartitionScheme yet.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments

2019-08-28 Thread Omer Ozarslan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917965#comment-16917965
 ] 

Omer Ozarslan commented on ARROW-6371:
--

Thanks. I replied this thread over email yesterday, but I guess the response 
didn't get through for some reason.

I submitted the PR.

> [Doc] Row to columnar conversion example mentions arrow::Column in comments
> ---
>
> Key: ARROW-6371
> URL: https://issues.apache.org/jira/browse/ARROW-6371
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Omer Ozarslan
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
> {code:cpp}
> // The final representation should be an `arrow::Table` which in turn is made 
> up of
> // an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is 
> again a
> // named collection of one or more `arrow::Array` instances. As the first 
> step, we
> // will iterate over the data and build up the arrays incrementally.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-08-28 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-5101:
-

Assignee: Krisztian Szucs

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: conda
> Fix For: 0.14.1
>
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-08-28 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5101:
--
Fix Version/s: (was: 0.15.0)
   0.14.1

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: conda
> Fix For: 0.14.1
>
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-08-28 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5101.
---
Resolution: Fixed

Fixed by https://github.com/conda-forge/arrow-cpp-feedstock/pull/79

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: conda
> Fix For: 0.14.1
>
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5101) [Packaging] Avoid bundling static libraries in Windows conda packages

2019-08-28 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917950#comment-16917950
 ] 

Antoine Pitrou commented on ARROW-5101:
---

I confirm that the `arrow-cpp-0.14.1` size is very reasonable on Windows (10.6 
MB download, ~34 MB on disk).

> [Packaging] Avoid bundling static libraries in Windows conda packages
> -
>
> Key: ARROW-5101
> URL: https://issues.apache.org/jira/browse/ARROW-5101
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Packaging
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: conda
> Fix For: 0.15.0
>
>
> We're currently bundling static libraries in Windows conda packages. 
> Unfortunately, it causes these to be quite large:
> {code:bash}
> $ ls -la ./Library/lib
> total 507808
> drwxrwxr-x 4 antoine antoine  4096 avril  3 10:28 .
> drwxrwxr-x 5 antoine antoine  4096 avril  3 10:28 ..
> -rw-rw-r-- 1 antoine antoine   1507048 avril  1 20:58 arrow.lib
> -rw-rw-r-- 1 antoine antoine 76184 avril  1 20:59 arrow_python.lib
> -rw-rw-r-- 1 antoine antoine  61323846 avril  1 21:00 arrow_python_static.lib
> -rw-rw-r-- 1 antoine antoine 32809 avril  1 21:02 arrow_static.lib
> drwxrwxr-x 3 antoine antoine  4096 avril  3 10:28 cmake
> -rw-rw-r-- 1 antoine antoine491292 avril  1 21:02 parquet.lib
> -rw-rw-r-- 1 antoine antoine 128473780 avril  1 21:03 parquet_static.lib
> drwxrwxr-x 2 antoine antoine  4096 avril  3 10:27 pkgconfig
> {code}
> (see files in https://anaconda.org/conda-forge/arrow-cpp/files )
> We should probably only ship dynamic libraries under Windows, as those are 
> reasonably small.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6378) [C++][Dataset] Implement TreeDataSource

2019-08-28 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-6378:
--
Component/s: C++
 Labels: dataset  (was: )

> [C++][Dataset] Implement TreeDataSource
> ---
>
> Key: ARROW-6378
> URL: https://issues.apache.org/jira/browse/ARROW-6378
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> The TreeDataSource is required to support partitions pruning of sub-trees.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6378) [C++][Dataset] Implement TreeDataSource

2019-08-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917944#comment-16917944
 ] 

Wes McKinney commented on ARROW-6378:
-

Should this just be {{PartitionDataSource}}?

> [C++][Dataset] Implement TreeDataSource
> ---
>
> Key: ARROW-6378
> URL: https://issues.apache.org/jira/browse/ARROW-6378
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> The TreeDataSource is required to support partitions pruning of sub-trees.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6378) [C++][Dataset] Implement TreeDataSource

2019-08-28 Thread Francois Saint-Jacques (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques reassigned ARROW-6378:
-

Assignee: Francois Saint-Jacques

> [C++][Dataset] Implement TreeDataSource
> ---
>
> Key: ARROW-6378
> URL: https://issues.apache.org/jira/browse/ARROW-6378
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> The TreeDataSource is required to support partitions pruning of sub-trees.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6378) [C++][Dataset] Implement TreeDataSource

2019-08-28 Thread Francois Saint-Jacques (Jira)

Francois Saint-Jacques created ARROW-6378:
-

 Summary: [C++][Dataset] Implement TreeDataSource
 Key: ARROW-6378
 URL: https://issues.apache.org/jira/browse/ARROW-6378
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques


The TreeDataSource is required to support partitions pruning of sub-trees.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6377) [C++] Extending STL API to support row-wise conversion

2019-08-28 Thread Omer Ozarslan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917936#comment-16917936
 ] 

Omer Ozarslan commented on ARROW-6377:
--

On a side note, this _might_ have a better performance due to use of compile 
time knowledge, but it eventually comes down to benchmark.

> [C++] Extending STL API to support row-wise conversion
> --
>
> Key: ARROW-6377
> URL: https://issues.apache.org/jira/browse/ARROW-6377
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
>
> Using array builders is the recommended way in the documentation for 
> converting rowwise data to arrow tables currently. However, array builders 
> has a low level interface to support various use cases in the library. They 
> require additional boilerplate due to type erasure, although some of these 
> boilerplate could be avoided in compile time if the schema is already known 
> and fixed (also discussed in ARROW-4067).
> In some other part of the library, STL API provides a nice abstraction over 
> builders by inferring data type and builders from values provided, reducing 
> the boilerplate significantly. It handles automatically converting tuples 
> with a limited set of native types currently: numeric types, string and 
> vector (+ nullable variations of these in case ARROW-6326 is merged). It also 
> allows passing references in tuple values (implemented recently in 
> ARROW-6284).
> As a more concrete example, this is the code which can be used to convert 
> {{row_data}} provided in examples:
>   
> {code:cpp}
> arrow::Status VectorToColumnarTableSTL(const std::vector& 
> rows,
>std::shared_ptr* table) {
> auto rng = rows | ranges::views::transform([](const data_row& row) {
>return std::tuple&>(
>row.id, row.cost, row.cost_components);
>});
> return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
>{"id", "cost", "cost_components"},
>table);
> }
> {code}
> So, it allows more concise code for consumers of the API compared to using 
> builders directly.
> There is no direct support by the library for other types (binary, struct, 
> union etc. types or converting iterable objects other than vectors to lists). 
> Users are provided a way to specialize their own data structures. One 
> limitation for implicit inference is that it is hard (or even impossible) to 
> infer exact type to use in some cases. For example, should 
> {{std::string_view}} value be inferred as string, binary, large binary or 
> list? This ambiguity can be avoided by providing some way for user to 
> explicitly state correct type for storing a column. For example a user can 
> return a so called {{BinaryCell}} class to return binary values.
> Proposed changes:
>  * Implementing cell "adapters": Cells are non-owning references for each 
> type. It's user's responsibility keep pointed values alive. (Can scalars be 
> used in this context?)
>  ** BinaryCell
>  ** StringCell
>  ** ListCell (fo adapting any Range)
>  ** StructCell
>  ** ...
>  * Primitive types don't need such adapters since their values are trivial to 
> cast (e.g. just use int8_t(value) to use Int8Type).
>  * Adding benchmarks for comparing with builder performance. There is likely 
> to be some performance penalty due to hindering compiler optimizations. Yet, 
> this is acceptable in exchange of a more concise code IMHO. For fine-grained 
> control over performance, it will be still possible to directly use builders.
> I have implemented something similar to BinaryCell for my use case. If above 
> changes sound reasonable, I will go ahead and start implementing other cells 
> to submit.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-6377) [C++] Extending STL API to support row-wise conversion

2019-08-28 Thread Omer Ozarslan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917936#comment-16917936
 ] 

Omer Ozarslan edited comment on ARROW-6377 at 8/28/19 5:08 PM:
---

On a side note, this _might_ have a better performance due to use of compile 
time knowledge, but it eventually comes down to benchmarking.


was (Author: ozars):
On a side note, this _might_ have a better performance due to use of compile 
time knowledge, but it eventually comes down to benchmark.

> [C++] Extending STL API to support row-wise conversion
> --
>
> Key: ARROW-6377
> URL: https://issues.apache.org/jira/browse/ARROW-6377
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
>
> Using array builders is the recommended way in the documentation for 
> converting rowwise data to arrow tables currently. However, array builders 
> has a low level interface to support various use cases in the library. They 
> require additional boilerplate due to type erasure, although some of these 
> boilerplate could be avoided in compile time if the schema is already known 
> and fixed (also discussed in ARROW-4067).
> In some other part of the library, STL API provides a nice abstraction over 
> builders by inferring data type and builders from values provided, reducing 
> the boilerplate significantly. It handles automatically converting tuples 
> with a limited set of native types currently: numeric types, string and 
> vector (+ nullable variations of these in case ARROW-6326 is merged). It also 
> allows passing references in tuple values (implemented recently in 
> ARROW-6284).
> As a more concrete example, this is the code which can be used to convert 
> {{row_data}} provided in examples:
>   
> {code:cpp}
> arrow::Status VectorToColumnarTableSTL(const std::vector& 
> rows,
>std::shared_ptr* table) {
> auto rng = rows | ranges::views::transform([](const data_row& row) {
>return std::tuple&>(
>row.id, row.cost, row.cost_components);
>});
> return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
>{"id", "cost", "cost_components"},
>table);
> }
> {code}
> So, it allows more concise code for consumers of the API compared to using 
> builders directly.
> There is no direct support by the library for other types (binary, struct, 
> union etc. types or converting iterable objects other than vectors to lists). 
> Users are provided a way to specialize their own data structures. One 
> limitation for implicit inference is that it is hard (or even impossible) to 
> infer exact type to use in some cases. For example, should 
> {{std::string_view}} value be inferred as string, binary, large binary or 
> list? This ambiguity can be avoided by providing some way for user to 
> explicitly state correct type for storing a column. For example a user can 
> return a so called {{BinaryCell}} class to return binary values.
> Proposed changes:
>  * Implementing cell "adapters": Cells are non-owning references for each 
> type. It's user's responsibility keep pointed values alive. (Can scalars be 
> used in this context?)
>  ** BinaryCell
>  ** StringCell
>  ** ListCell (fo adapting any Range)
>  ** StructCell
>  ** ...
>  * Primitive types don't need such adapters since their values are trivial to 
> cast (e.g. just use int8_t(value) to use Int8Type).
>  * Adding benchmarks for comparing with builder performance. There is likely 
> to be some performance penalty due to hindering compiler optimizations. Yet, 
> this is acceptable in exchange of a more concise code IMHO. For fine-grained 
> control over performance, it will be still possible to directly use builders.
> I have implemented something similar to BinaryCell for my use case. If above 
> changes sound reasonable, I will go ahead and start implementing other cells 
> to submit.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6376) [Developer] PR merge script has "master" target ref hard-coded

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6376:
--
Labels: pull-request-available  (was: )

> [Developer] PR merge script has "master" target ref hard-coded
> --
>
> Key: ARROW-6376
> URL: https://issues.apache.org/jira/browse/ARROW-6376
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> If the target ref of a PR is something other than master, we should merge PRs 
> into that branch



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6377) [C++] Extending STL API to support row-wise conversion

2019-08-28 Thread Omer Ozarslan (Jira)

Omer Ozarslan created ARROW-6377:


 Summary: [C++] Extending STL API to support row-wise conversion
 Key: ARROW-6377
 URL: https://issues.apache.org/jira/browse/ARROW-6377
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Omer Ozarslan


Using array builders is the recommended way in the documentation for converting 
rowwise data to arrow tables currently. However, array builders has a low level 
interface to support various use cases in the library. They require additional 
boilerplate due to type erasure, although some of these boilerplate could be 
avoided in compile time if the schema is already known and fixed (also 
discussed in ARROW-4067).

In some other part of the library, STL API provides a nice abstraction over 
builders by inferring data type and builders from values provided, reducing the 
boilerplate significantly. It handles automatically converting tuples with a 
limited set of native types currently: numeric types, string and vector (+ 
nullable variations of these in case ARROW-6326 is merged). It also allows 
passing references in tuple values (implemented recently in ARROW-6284).

As a more concrete example, this is the code which can be used to convert 
{{row_data}} provided in examples:
  
{code:cpp}
arrow::Status VectorToColumnarTableSTL(const std::vector& rows,
   std::shared_ptr* table) {
auto rng = rows | ranges::views::transform([](const data_row& row) {
   return std::tuple&>(
   row.id, row.cost, row.cost_components);
   });
return arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rng,
   {"id", "cost", "cost_components"},
   table);
}

{code}
So, it allows more concise code for consumers of the API compared to using 
builders directly.

There is no direct support by the library for other types (binary, struct, 
union etc. types or converting iterable objects other than vectors to lists). 
Users are provided a way to specialize their own data structures. One 
limitation for implicit inference is that it is hard (or even impossible) to 
infer exact type to use in some cases. For example, should {{std::string_view}} 
value be inferred as string, binary, large binary or list? This ambiguity can 
be avoided by providing some way for user to explicitly state correct type for 
storing a column. For example a user can return a so called {{BinaryCell}} 
class to return binary values.

Proposed changes:
 * Implementing cell "adapters": Cells are non-owning references for each type. 
It's user's responsibility keep pointed values alive. (Can scalars be used in 
this context?)
 ** BinaryCell
 ** StringCell
 ** ListCell (fo adapting any Range)
 ** StructCell
 ** ...
 * Primitive types don't need such adapters since their values are trivial to 
cast (e.g. just use int8_t(value) to use Int8Type).
 * Adding benchmarks for comparing with builder performance. There is likely to 
be some performance penalty due to hindering compiler optimizations. Yet, this 
is acceptable in exchange of a more concise code IMHO. For fine-grained control 
over performance, it will be still possible to directly use builders.

I have implemented something similar to BinaryCell for my use case. If above 
changes sound reasonable, I will go ahead and start implementing other cells to 
submit.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6314) [C++] Implement changes to ensure flatbuffer alignment.

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6314.
-
Resolution: Fixed

Issue resolved by pull request 5211
[https://github.com/apache/arrow/pull/5211]

> [C++] Implement changes to ensure flatbuffer alignment.
> ---
>
> Key: ARROW-6314
> URL: https://issues.apache.org/jira/browse/ARROW-6314
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6376) [Developer] PR merge script has "master" target ref hard-coded

2019-08-28 Thread Wes McKinney (Jira)

Wes McKinney created ARROW-6376:
---

 Summary: [Developer] PR merge script has "master" target ref 
hard-coded
 Key: ARROW-6376
 URL: https://issues.apache.org/jira/browse/ARROW-6376
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Wes McKinney
 Fix For: 0.15.0


If the target ref of a PR is something other than master, we should merge PRs 
into that branch



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6376) [Developer] PR merge script has "master" target ref hard-coded

2019-08-28 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6376:
---

Assignee: Wes McKinney

> [Developer] PR merge script has "master" target ref hard-coded
> --
>
> Key: ARROW-6376
> URL: https://issues.apache.org/jira/browse/ARROW-6376
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> If the target ref of a PR is something other than master, we should merge PRs 
> into that branch



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6370) [JS] Table.from adds 0 on int columns

2019-08-28 Thread Sascha Hofmann (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917896#comment-16917896
 ] 

Sascha Hofmann commented on ARROW-6370:
---

Indeed! Converting it to int32() in python solved the issue of added 0s.

> [JS] Table.from adds 0 on int columns
> -
>
> Key: ARROW-6370
> URL: https://issues.apache.org/jira/browse/ARROW-6370
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Sascha Hofmann
>Priority: Major
>
> I am generating an arrow table in pyarrow and send it via gRPC like this:
> {code:java}
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> yield ds.Response(
> status=200,
> loading=False,
> response=[sink.getvalue().to_pybytes()]   
> )
> {code}
> On the javascript end, I parse it like that:
> {code:java}
>  Table.from(response.getResponseList()[0])
> {code}
> That works but when I look at the actual table, int columns have a 0 for 
> every other row. String columns seem to be parsed just fine. 
> The Python byte array created from to_pybytes() has the same length as 
> received in javascript. I am also able to recreate the original table for the 
> byte array in Python. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6375) [C++] Extend ConversionTraits to allow efficiently appending list values in STL API

2019-08-28 Thread Omer Ozarslan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917884#comment-16917884
 ] 

Omer Ozarslan commented on ARROW-6375:
--

[~pitrou] Sure, I will. I'm also opening another issue about extending STL API 
for rowwise conversion in general.

> [C++] Extend ConversionTraits to allow efficiently appending list values in 
> STL API
> ---
>
> Key: ARROW-6375
> URL: https://issues.apache.org/jira/browse/ARROW-6375
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
>
> I was trying to benchmark performances of using array builders vs. STL API 
> for converting some row data to arrow tables. I realized it is around 1.5-1.8 
> times slower to convert {{std::vector}} values with STL API than doing so 
> with builder API. It appears this is primarily due to appending rows via 
> {{...::Append}} method by iterating over 
> {{ConversionTrait>::AppendRow}} for each value.
> Calling {{...::AppendValues}} would make it more efficient, however, 
> {{ConversionTraits}} doesn't offer a way for appending more than one cells 
> ({{AppendRow}} takes a builder and a single cell as its parameters).
> Would it be possible to extend conversion traits with an optional method 
> {{AppendRows(Builder, Cell*, size_t),}} which allows template specialization 
> to efficiently append multiple cells at once? In the example above this 
> function would be called with {{std::vector::data()}} and 
> {{std::vector::size()}} if provided. If such method isn't provided by the 
> specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
> used as default.
> [This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
>  is the particular part in code that will be replaced in practice. Instead of 
> directly calling AppendRow in a for loop, a public helper function (e.g. 
> {{stl::AppendRows}}) can be provided, in which it implements above logic.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6375) [C++] Extend ConversionTraits to allow efficiently appending list values in STL API

2019-08-28 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917871#comment-16917871
 ] 

Antoine Pitrou commented on ARROW-6375:
---

[~ozars] What you're proposing sounds ok on the principle. Would you like to 
try submitting a PR?

> [C++] Extend ConversionTraits to allow efficiently appending list values in 
> STL API
> ---
>
> Key: ARROW-6375
> URL: https://issues.apache.org/jira/browse/ARROW-6375
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
>
> I was trying to benchmark performances of using array builders vs. STL API 
> for converting some row data to arrow tables. I realized it is around 1.5-1.8 
> times slower to convert {{std::vector}} values with STL API than doing so 
> with builder API. It appears this is primarily due to appending rows via 
> {{...::Append}} method by iterating over 
> {{ConversionTrait>::AppendRow}} for each value.
> Calling {{...::AppendValues}} would make it more efficient, however, 
> {{ConversionTraits}} doesn't offer a way for appending more than one cells 
> ({{AppendRow}} takes a builder and a single cell as its parameters).
> Would it be possible to extend conversion traits with an optional method 
> {{AppendRows(Builder, Cell*, size_t),}} which allows template specialization 
> to efficiently append multiple cells at once? In the example above this 
> function would be called with {{std::vector::data()}} and 
> {{std::vector::size()}} if provided. If such method isn't provided by the 
> specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
> used as default.
> [This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
>  is the particular part in code that will be replaced in practice. Instead of 
> directly calling AppendRow in a for loop, a public helper function (e.g. 
> {{stl::AppendRows}}) can be provided, in which it implements above logic.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6370) [JS] Table.from adds 0 on int columns

2019-08-28 Thread Sascha Hofmann (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917864#comment-16917864
 ] 

Sascha Hofmann commented on ARROW-6370:
---

Cool thank you! Using latest so v10.16.3

> [JS] Table.from adds 0 on int columns
> -
>
> Key: ARROW-6370
> URL: https://issues.apache.org/jira/browse/ARROW-6370
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Sascha Hofmann
>Priority: Major
>
> I am generating an arrow table in pyarrow and send it via gRPC like this:
> {code:java}
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> yield ds.Response(
> status=200,
> loading=False,
> response=[sink.getvalue().to_pybytes()]   
> )
> {code}
> On the javascript end, I parse it like that:
> {code:java}
>  Table.from(response.getResponseList()[0])
> {code}
> That works but when I look at the actual table, int columns have a 0 for 
> every other row. String columns seem to be parsed just fine. 
> The Python byte array created from to_pybytes() has the same length as 
> received in javascript. I am also able to recreate the original table for the 
> byte array in Python. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6371) [Doc] Row to columnar conversion example mentions arrow::Column in comments

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6371:
--
Labels: pull-request-available  (was: )

> [Doc] Row to columnar conversion example mentions arrow::Column in comments
> ---
>
> Key: ARROW-6371
> URL: https://issues.apache.org/jira/browse/ARROW-6371
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Omer Ozarslan
>Priority: Minor
>  Labels: pull-request-available
>
> https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
> {code:cpp}
> // The final representation should be an `arrow::Table` which in turn is made 
> up of
> // an `arrow::Schema` and a list of `arrow::Column`. An `arrow::Column` is 
> again a
> // named collection of one or more `arrow::Array` instances. As the first 
> step, we
> // will iterate over the data and build up the arrays incrementally.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6370) [JS] Table.from adds 0 on int columns

2019-08-28 Thread Brian Hulette (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917857#comment-16917857
 ] 

Brian Hulette commented on ARROW-6370:
--

Yeah I suspect converting to int32 will solve your problem. But this is still a 
bug so I'll see if I can reproduce it :)
What version of node are you using?

> [JS] Table.from adds 0 on int columns
> -
>
> Key: ARROW-6370
> URL: https://issues.apache.org/jira/browse/ARROW-6370
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Sascha Hofmann
>Priority: Major
>
> I am generating an arrow table in pyarrow and send it via gRPC like this:
> {code:java}
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> yield ds.Response(
> status=200,
> loading=False,
> response=[sink.getvalue().to_pybytes()]   
> )
> {code}
> On the javascript end, I parse it like that:
> {code:java}
>  Table.from(response.getResponseList()[0])
> {code}
> That works but when I look at the actual table, int columns have a 0 for 
> every other row. String columns seem to be parsed just fine. 
> The Python byte array created from to_pybytes() has the same length as 
> received in javascript. I am also able to recreate the original table for the 
> byte array in Python. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6375) [C++] Extend ConversionTraits to allow efficiently appending list values in STL API

2019-08-28 Thread Omer Ozarslan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omer Ozarslan updated ARROW-6375:
-
Description: 
I was trying to benchmark performances of using array builders vs. STL API for 
converting some row data to arrow tables. I realized it is around 1.5-1.8 times 
slower to convert {{std::vector}} values with STL API than doing so with 
builder API. It appears this is primarily due to appending rows via 
{{...::Append}} method by iterating over 
{{ConversionTrait>::AppendRow}} for each value.

Calling {{...::AppendValues}} would make it more efficient, however, 
{{ConversionTraits}} doesn't offer a way for appending more than one cells 
({{AppendRow}} takes a builder and a single cell as its parameters).

Would it be possible to extend conversion traits with an optional method 
{{AppendRows(Builder, Cell*, size_t),}} which allows template specialization to 
efficiently append multiple cells at once? In the example above this function 
would be called with {{std::vector::data()}} and {{std::vector::size()}} if 
provided. If such method isn't provided by the specialization, current behavior 
(i.e. iterating over {{AppendRow}}) can be used as default.

[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
 is the particular part in code that will be replaced in practice. Instead of 
directly calling AppendRow in a for loop, a public helper function (e.g. 
{{stl::AppendRows}}) can be provided, in which it implements above logic.

  was:
I was trying to benchmark performances of using array builders vs. STL API for 
converting some row data to arrow tables. I realized it is around 1.5-1.8 times 
slower to convert {{std::vector}} values with STL API than doing so with 
builder API. It appears this is primarily due to appending rows via 
{{...::Append}} method by iterating over 
{{ConversionTrait>::AppendRow}} for each value.

Calling {{...::AppendValues}} would make it more efficient, however, 
{{ConversionTraits}} doesn't offer a way for appending more than one cells 
({{AppendRow}} takes a builder and a single cell as its parameters).

Would it be possible to extend conversion traits with an optional method 
{{AppendRows(Builder, Cell*, size_t),}} which allows template specialization to 
efficiently append multiple values at once? In the example above this function 
would be called with {{std::vector::data()}} and {{std::vector::size()}} if 
provided. If such method isn't provided by the specialization, current behavior 
(i.e. iterating over {{AppendRow}}) can be used as default.

[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
 is the particular part in code that will be replaced in practice. Instead of 
directly calling AppendRow in a for loop, a public helper function (e.g. 
{{stl::AppendRows}}) can be provided, in which it implements above logic.


> [C++] Extend ConversionTraits to allow efficiently appending list values in 
> STL API
> ---
>
> Key: ARROW-6375
> URL: https://issues.apache.org/jira/browse/ARROW-6375
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
>
> I was trying to benchmark performances of using array builders vs. STL API 
> for converting some row data to arrow tables. I realized it is around 1.5-1.8 
> times slower to convert {{std::vector}} values with STL API than doing so 
> with builder API. It appears this is primarily due to appending rows via 
> {{...::Append}} method by iterating over 
> {{ConversionTrait>::AppendRow}} for each value.
> Calling {{...::AppendValues}} would make it more efficient, however, 
> {{ConversionTraits}} doesn't offer a way for appending more than one cells 
> ({{AppendRow}} takes a builder and a single cell as its parameters).
> Would it be possible to extend conversion traits with an optional method 
> {{AppendRows(Builder, Cell*, size_t),}} which allows template specialization 
> to efficiently append multiple cells at once? In the example above this 
> function would be called with {{std::vector::data()}} and 
> {{std::vector::size()}} if provided. If such method isn't provided by the 
> specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
> used as default.
> [This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
>  is the particular part in code that will be replaced in practice. Instead of 
> directly calling AppendRow in a for loop, a public helper function (e.g. 
> {{stl::AppendRows}}) can be provided, in which it implements above logic.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6375) [C++] Extend ConversionTraits to allow efficiently appending list values in STL API

2019-08-28 Thread Omer Ozarslan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omer Ozarslan updated ARROW-6375:
-
Description: 
I was trying to benchmark performances of using array builders vs. STL API for 
converting some row data to arrow tables. I realized it is around 1.5-1.8 times 
slower to convert {{std::vector}} values with STL API than doing so with 
builder API. It appears this is primarily due to appending rows via 
{{...::Append}} method by iterating over 
{{ConversionTrait>::AppendRow}} for each value.

Calling {{...::AppendValues}} would make it more efficient, however, 
{{ConversionTraits}} doesn't offer a way for appending more than one cells 
({{AppendRow}} takes a builder and a single cell as its parameters).

Would it be possible to extend conversion traits with an optional method 
{{AppendRows(Builder, Cell*, size_t),}} which allows template specialization to 
efficiently append multiple values at once? In the example above this function 
would be called with {{std::vector::data()}} and {{std::vector::size()}} if 
provided. If such method isn't provided by the specialization, current behavior 
(i.e. iterating over {{AppendRow}}) can be used as default.

[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
 is the particular part in code that will be replaced in practice. Instead of 
directly calling AppendRow in a for loop, a public helper function (e.g. 
{{stl::AppendRows}}) can be provided, in which it implements above logic.

  was:
I was trying to benchmark performances of using array builders vs. STL API for 
converting some row data to arrow tables. I realized it is around 1.5-1.8 times 
slower to convert {{std::vector}} values with STL API than doing so with 
builder API. It appears this is primarily due to appending rows via 
{{...::Append}} method by iterating over 
{{ConversionTrait>::AppendRow}} for each value.

Calling {{...::AppendValues}} would make it more efficient, however, 
{{ConversionTraits}} doesn't offer a way for appending more than one cells 
({{AppendRow}} takes a builder and a single cell as its parameters).

Would it be possible to extend conversion traits with an optional metho\{{d 
}}{{AppendRows(Builder, Cell*, size_t)}} which allows template specialization 
to efficiently append multiple values at once? In the example above this 
function would be called with {{std::vector::data()}} and 
{{std::vector::size()}} if provided. If such method isn't provided by the 
specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
used as default.

[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
 is the particular part in code that will be replaced in practice. Instead of 
directly calling AppendRow in a for loop, a public helper function (e.g. 
{{stl::AppendRows}}) can be provided, in which it implements above logic.


> [C++] Extend ConversionTraits to allow efficiently appending list values in 
> STL API
> ---
>
> Key: ARROW-6375
> URL: https://issues.apache.org/jira/browse/ARROW-6375
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
>
> I was trying to benchmark performances of using array builders vs. STL API 
> for converting some row data to arrow tables. I realized it is around 1.5-1.8 
> times slower to convert {{std::vector}} values with STL API than doing so 
> with builder API. It appears this is primarily due to appending rows via 
> {{...::Append}} method by iterating over 
> {{ConversionTrait>::AppendRow}} for each value.
> Calling {{...::AppendValues}} would make it more efficient, however, 
> {{ConversionTraits}} doesn't offer a way for appending more than one cells 
> ({{AppendRow}} takes a builder and a single cell as its parameters).
> Would it be possible to extend conversion traits with an optional method 
> {{AppendRows(Builder, Cell*, size_t),}} which allows template specialization 
> to efficiently append multiple values at once? In the example above this 
> function would be called with {{std::vector::data()}} and 
> {{std::vector::size()}} if provided. If such method isn't provided by the 
> specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
> used as default.
> [This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
>  is the particular part in code that will be replaced in practice. Instead of 
> directly calling AppendRow in a for loop, a public helper function (e.g. 
> {{stl::AppendRows}}) can be provided, in which it implements above logic.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6375) [C++] Extend ConversionTraits to allow efficiently appending list values in STL API

2019-08-28 Thread Omer Ozarslan (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omer Ozarslan updated ARROW-6375:
-
Description: 
I was trying to benchmark performances of using array builders vs. STL API for 
converting some row data to arrow tables. I realized it is around 1.5-1.8 times 
slower to convert {{std::vector}} values with STL API than doing so with 
builder API. It appears this is primarily due to appending rows via 
{{...::Append}} method by iterating over 
{{ConversionTrait>::AppendRow}} for each value.

Calling {{...::AppendValues}} would make it more efficient, however, 
{{ConversionTraits}} doesn't offer a way for appending more than one cells 
({{AppendRow}} takes a builder and a single cell as its parameters).

Would it be possible to extend conversion traits with an optional metho\{{d 
}}{{AppendRows(Builder, Cell*, size_t)}} which allows template specialization 
to efficiently append multiple values at once? In the example above this 
function would be called with {{std::vector::data()}} and 
{{std::vector::size()}} if provided. If such method isn't provided by the 
specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
used as default.

[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
 is the particular part in code that will be replaced in practice. Instead of 
directly calling AppendRow in a for loop, a public helper function (e.g. 
{{stl::AppendRows}}) can be provided, in which it implements above logic.

  was:
I was trying to benchmark performances of using array builders vs. STL API for 
converting some row data to arrow tables. I realized it is around 1.5-1.8 times 
slower to convert {{std::vector}} values with STL API than with builder API. It 
appears this is primarily due to appending rows via {{...::Append}} method by 
iterating over {{ConversionTrait>::AppendRow}} for each value.

Calling {{...::AppendValues}} would make it more efficient, however, 
{{ConversionTraits}} doesn't offer a way for appending more than one cells 
({{AppendRow}} takes a builder and a single cell as its parameters).

Would it be possible to extend conversion traits with an optional metho{{d 
}}{{AppendRows(Builder, Cell*, size_t)}} which allows template specialization 
to efficiently append multiple values at once? In the example above this 
function would be called with {{std::vector::data()}} and 
{{std::vector::size()}} if provided. If such method isn't provided by the 
specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
used as default.

[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
 is the particular part in code that will be replaced in practice. Instead of 
directly calling AppendRow in a for loop, a public helper function (e.g. 
{{stl::AppendRows}}) can be provided, in which it implements above logic.


> [C++] Extend ConversionTraits to allow efficiently appending list values in 
> STL API
> ---
>
> Key: ARROW-6375
> URL: https://issues.apache.org/jira/browse/ARROW-6375
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Omer Ozarslan
>Priority: Major
>
> I was trying to benchmark performances of using array builders vs. STL API 
> for converting some row data to arrow tables. I realized it is around 1.5-1.8 
> times slower to convert {{std::vector}} values with STL API than doing so 
> with builder API. It appears this is primarily due to appending rows via 
> {{...::Append}} method by iterating over 
> {{ConversionTrait>::AppendRow}} for each value.
> Calling {{...::AppendValues}} would make it more efficient, however, 
> {{ConversionTraits}} doesn't offer a way for appending more than one cells 
> ({{AppendRow}} takes a builder and a single cell as its parameters).
> Would it be possible to extend conversion traits with an optional metho\{{d 
> }}{{AppendRows(Builder, Cell*, size_t)}} which allows template specialization 
> to efficiently append multiple values at once? In the example above this 
> function would be called with {{std::vector::data()}} and 
> {{std::vector::size()}} if provided. If such method isn't provided by the 
> specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
> used as default.
> [This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
>  is the particular part in code that will be replaced in practice. Instead of 
> directly calling AppendRow in a for loop, a public helper function (e.g. 
> {{stl::AppendRows}}) can be provided, in which it implements above logic.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6375) [C++] Extend ConversionTraits to allow efficiently appending list values in STL API

2019-08-28 Thread Omer Ozarslan (Jira)

Omer Ozarslan created ARROW-6375:


 Summary: [C++] Extend ConversionTraits to allow efficiently 
appending list values in STL API
 Key: ARROW-6375
 URL: https://issues.apache.org/jira/browse/ARROW-6375
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Omer Ozarslan


I was trying to benchmark performances of using array builders vs. STL API for 
converting some row data to arrow tables. I realized it is around 1.5-1.8 times 
slower to convert {{std::vector}} values with STL API than with builder API. It 
appears this is primarily due to appending rows via {{...::Append}} method by 
iterating over {{ConversionTrait>::AppendRow}} for each value.

Calling {{...::AppendValues}} would make it more efficient, however, 
{{ConversionTraits}} doesn't offer a way for appending more than one cells 
({{AppendRow}} takes a builder and a single cell as its parameters).

Would it be possible to extend conversion traits with an optional metho{{d 
}}{{AppendRows(Builder, Cell*, size_t)}} which allows template specialization 
to efficiently append multiple values at once? In the example above this 
function would be called with {{std::vector::data()}} and 
{{std::vector::size()}} if provided. If such method isn't provided by the 
specialization, current behavior (i.e. iterating over {{AppendRow}}) can be 
used as default.

[This|https://github.com/apache/arrow/blob/e29732be86958e563801c55d3fcd8dc3fe4e9801/cpp/src/arrow/stl.h#L97-L100]
 is the particular part in code that will be replaced in practice. Instead of 
directly calling AppendRow in a for loop, a public helper function (e.g. 
{{stl::AppendRows}}) can be provided, in which it implements above logic.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6370) [JS] Table.from adds 0 on int columns

2019-08-28 Thread Sascha Hofmann (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917852#comment-16917852
 ] 

Sascha Hofmann commented on ARROW-6370:
---

Yes, the column is int64.

For creating the RecordBatch in python: I am reading a pyarrow table from a 
parquet file, which itself was created from a csv. I tested this on different 
CSVs with the same behaviour. 

I assume above issue is creating our problem. We are using arrow in an Electron 
app (so Node.js) with a python backend server. The bytes are sent via gRPC.

 

I will try to convert the int columns to int32 in python and see what's 
happening.

 

> [JS] Table.from adds 0 on int columns
> -
>
> Key: ARROW-6370
> URL: https://issues.apache.org/jira/browse/ARROW-6370
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Sascha Hofmann
>Priority: Major
>
> I am generating an arrow table in pyarrow and send it via gRPC like this:
> {code:java}
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> yield ds.Response(
> status=200,
> loading=False,
> response=[sink.getvalue().to_pybytes()]   
> )
> {code}
> On the javascript end, I parse it like that:
> {code:java}
>  Table.from(response.getResponseList()[0])
> {code}
> That works but when I look at the actual table, int columns have a 0 for 
> every other row. String columns seem to be parsed just fine. 
> The Python byte array created from to_pybytes() has the same length as 
> received in javascript. I am also able to recreate the original table for the 
> byte array in Python. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6354) [C++] Building without Parquet fails

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6354:
--
Labels: pull-request-available  (was: )

> [C++] Building without Parquet fails
> 
>
> Key: ARROW-6354
> URL: https://issues.apache.org/jira/browse/ARROW-6354
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>
> Seems like this is a recent regression:
> {code}
> [214/300] Building CXX object 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o
> FAILED: 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o 
> /usr/bin/ccache /usr/bin/g++-7  -DARROW_EXTRA_ERROR_CONTEXT -DARROW_USE_SIMD 
> -DARROW_WITH_BROTLI -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_SNAPPY 
> -DARROW_WITH_ZLIB -DARROW_WITH_ZSTD -DAWS_COMMON_USE_IMPORT_EXPORT 
> -DAWS_EVENT_STREAM_USE_IMPORT_EXPORT -DAWS_SDK_VERSION_MAJOR=1 
> -DAWS_SDK_VERSION_MINOR=7 -DAWS_SDK_VERSION_PATCH=160 -Isrc -I../src -isystem 
> /home/antoine/miniconda3/envs/pyarrow/include -isystem 
> double-conversion_ep/src/double-conversion_ep/include -isystem 
> ../thirdparty/hadoop/include -Wno-noexcept-type  -fdiagnostics-color=always 
> -ggdb -O0  -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable 
> -Werror -msse4.2  -D_GLIBCXX_USE_CXX11_ABI=1 -D_GLIBCXX_USE_CXX11_ABI=1 
> -fno-omit-frame-pointer -g -fPIE   -pthread -std=gnu++11 -MD -MT 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o 
> -MF 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o.d
>  -o 
> src/arrow/dataset/CMakeFiles/arrow-dataset-dataset-test.dir/dataset_test.cc.o 
> -c ../src/arrow/dataset/dataset_test.cc
> In file included from ../src/parquet/arrow/writer.h:25:0,
>  from ../src/arrow/dataset/test_util.h:27,
>  from ../src/arrow/dataset/dataset_test.cc:20:
> ../src/parquet/properties.h:30:10: fatal error: parquet/parquet_version.h: 
> Aucun fichier ou dossier de ce type
>  #include "parquet/parquet_version.h"
>   ^~~
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6370) [JS] Table.from adds 0 on int columns

2019-08-28 Thread Brian Hulette (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917848#comment-16917848
 ] 

Brian Hulette commented on ARROW-6370:
--

What is the type of the int column, int64? int64s behave a little weird in JS. 
If running in a platform with BigInt, calls to Int64Array.get _should_ return 
an instance of it, otherwise they will return a two element slice of Int32Array 
with the high, low bytes.

Could you provide a little more detail on how you're generating the record 
batches? and maybe how you're observing the ints?

> [JS] Table.from adds 0 on int columns
> -
>
> Key: ARROW-6370
> URL: https://issues.apache.org/jira/browse/ARROW-6370
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.14.1
>Reporter: Sascha Hofmann
>Priority: Major
>
> I am generating an arrow table in pyarrow and send it via gRPC like this:
> {code:java}
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> yield ds.Response(
> status=200,
> loading=False,
> response=[sink.getvalue().to_pybytes()]   
> )
> {code}
> On the javascript end, I parse it like that:
> {code:java}
>  Table.from(response.getResponseList()[0])
> {code}
> That works but when I look at the actual table, int columns have a 0 for 
> every other row. String columns seem to be parsed just fine. 
> The Python byte array created from to_pybytes() has the same length as 
> received in javascript. I am also able to recreate the original table for the 
> byte array in Python. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions

2019-08-28 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917837#comment-16917837
 ] 

Wes McKinney commented on ARROW-3571:
-

It would be great if you can move everything to Sphinx

> [Wiki] Release management guide does not explain how to set up Crossbow or 
> where to find instructions
> -
>
> Key: ARROW-3571
> URL: https://issues.apache.org/jira/browse/ARROW-3571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>
> If you follow the guide, at one point it says "Launch a Crossbow build" but 
> provides no link to the setup instructions for this



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5645) [Python] Support inferring nested list types and converting from ndarray with ndim > 1

2019-08-28 Thread Simeon H.K. Fitch (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917827#comment-16917827
 ] 

Simeon H.K. Fitch commented on ARROW-5645:
--

This GitHub issue describes the desired end state:

https://github.com/apache/arrow/issues/4802

This feature is important for users of PySpark who want to construct tensors to 
feed them to ML libraries such as Keras via `pandas_udf`s.

> [Python] Support inferring nested list types and converting from ndarray with 
> ndim > 1
> --
>
> Key: ARROW-5645
> URL: https://issues.apache.org/jira/browse/ARROW-5645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> Follow up work to ARROW-4350



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6292) [C++] Add an option to build with mimalloc

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6292:
--
Labels: pull-request-available  (was: )

> [C++] Add an option to build with mimalloc
> --
>
> Key: ARROW-6292
> URL: https://issues.apache.org/jira/browse/ARROW-6292
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> It's a new allocator, Apache-licensed, by Microsoft. It claims very good 
> performance and is cross-platform (works on Windows and Unix).
> https://github.com/microsoft/mimalloc/
> There's a detailed set of APIs including aligned allocation and 
> zero-initialized allocation. However, zero-initialized reallocation doesn't 
> seem provided.
> https://microsoft.github.io/mimalloc/group__malloc.html#details



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-6292) [C++] Add an option to build with mimalloc

2019-08-28 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-6292:
-

Assignee: Antoine Pitrou

> [C++] Add an option to build with mimalloc
> --
>
> Key: ARROW-6292
> URL: https://issues.apache.org/jira/browse/ARROW-6292
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> It's a new allocator, Apache-licensed, by Microsoft. It claims very good 
> performance and is cross-platform (works on Windows and Unix).
> https://github.com/microsoft/mimalloc/
> There's a detailed set of APIs including aligned allocation and 
> zero-initialized allocation. However, zero-initialized reallocation doesn't 
> seem provided.
> https://microsoft.github.io/mimalloc/group__malloc.html#details



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-28 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917804#comment-16917804
 ] 

Andy Grove commented on ARROW-6372:
---

Another note on this ... in this case we have one expression that references a 
column and one that is a literal ... in this case the type of the column should 
take precedence and the literal value (Int64) should be cast to the type of the 
column (UInt32). The literal is Int64 just because that's the type the SQL 
parser uses for any literal integers.

> [Rust][Datafusion] Predictate push down optimization can break query plan
> -
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions

2019-08-28 Thread Krisztian Szucs (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917803#comment-16917803
 ] 

Krisztian Szucs commented on ARROW-3571:


It does actually link the README, but will update to the documentation page.

> [Wiki] Release management guide does not explain how to set up Crossbow or 
> where to find instructions
> -
>
> Key: ARROW-3571
> URL: https://issues.apache.org/jira/browse/ARROW-3571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Wiki
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>
> If you follow the guide, at one point it says "Launch a Crossbow build" but 
> provides no link to the setup instructions for this



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-28 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917788#comment-16917788
 ] 

Andy Grove commented on ARROW-6372:
---

If nobody picks this up before the weekend, I will resolve it then. Really we 
need to make get_supertype and can_coerce_from consistent and better 
implemented (it's pretty hacky right now and only supports a subset of data 
types).

> [Rust][Datafusion] Predictate push down optimization can break query plan
> -
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-28 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-6372:
--
Labels: beginner  (was: )

> [Rust][Datafusion] Predictate push down optimization can break query plan
> -
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Priority: Major
>  Labels: beginner
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5646) [Crossbow][Documentation] Move the user guide to the Sphinx documentation

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5646:
--
Labels: pull-request-available  (was: )

> [Crossbow][Documentation] Move the user guide to the Sphinx documentation
> -
>
> Key: ARROW-5646
> URL: https://issues.apache.org/jira/browse/ARROW-5646
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Documentation
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Move crossbow's already existing README to 
> docs/source/developers/crossbow.rst.
> Also answer how to run specific docker tasks with crossbow (like the 
> docker-cpp integration test).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5646) [Crossbow] Move the user guide to Sphinx

2019-08-28 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-5646:
---
Summary: [Crossbow] Move the user guide to Sphinx  (was: [Crossbow] User 
guide)

> [Crossbow] Move the user guide to Sphinx
> 
>
> Key: ARROW-5646
> URL: https://issues.apache.org/jira/browse/ARROW-5646
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Documentation
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
>
> Move crossbow's already existing README to 
> docs/source/developers/crossbow.rst.
> Also answer how to run specific docker tasks with crossbow (like the 
> docker-cpp integration test).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5646) [Crossbow][Documentation] Move the user guide to the Sphinx documentation

2019-08-28 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-5646:
---
Summary: [Crossbow][Documentation] Move the user guide to the Sphinx 
documentation  (was: [Crossbow] Move the user guide to Sphinx)

> [Crossbow][Documentation] Move the user guide to the Sphinx documentation
> -
>
> Key: ARROW-5646
> URL: https://issues.apache.org/jira/browse/ARROW-5646
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Documentation
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
>
> Move crossbow's already existing README to 
> docs/source/developers/crossbow.rst.
> Also answer how to run specific docker tasks with crossbow (like the 
> docker-cpp integration test).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6372) [Rust][Datafusion] Predictate push down optimization can break query plan

2019-08-28 Thread Andy Grove (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917782#comment-16917782
 ] 

Andy Grove commented on ARROW-6372:
---

Thanks Paddy. Here is a unit test to reproduce this (can be added to 
{{type_coercion.rs}}):
{code:java}
#[test]
fn test_add_u32_i64() {
binary_cast_test(
DataType::UInt32,
DataType::Int64,
"CAST(#0 AS Int64) Plus #1",
);
binary_cast_test(
DataType::Int64,
DataType::UInt32,
"#0 Plus CAST(#1 AS Int64)",
);
} {code}
The issue is that the {{can_coerce_from}} function in 
{{datafusion/src/logicalplan.rs}} does not support automatic coercion between 
signed and unsigned types, and is inconsistent with the logic in the 
{{get_supertype}} method in {{datafusion/src/optimizer/utils.rs}}. 

This is also somewhat related to 
https://issues.apache.org/jira/browse/ARROW-4957

> [Rust][Datafusion] Predictate push down optimization can break query plan
> -
>
> Key: ARROW-6372
> URL: https://issues.apache.org/jira/browse/ARROW-6372
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Affects Versions: 0.14.1
>Reporter: Paddy Horan
>Priority: Major
> Fix For: 0.15.0
>
>
> The following code reproduces the issue:
> [https://gist.github.com/paddyhoran/598db6cbb790fc5497320613e54a02c6]
> If you disable the predicate push down optimization it works fine.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6351) [Ruby] Improve Arrow#values performance

2019-08-28 Thread Yosuke Shiro (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro resolved ARROW-6351.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5194
[https://github.com/apache/arrow/pull/5194]

> [Ruby] Improve Arrow#values performance
> ---
>
> Key: ARROW-6351
> URL: https://issues.apache.org/jira/browse/ARROW-6351
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5508) [C++] Create reusable Iterator interface

2019-08-28 Thread Benjamin Kietzman (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917771#comment-16917771
 ] 

Benjamin Kietzman commented on ARROW-5508:
--

I think the existing interface is sufficient for our current use cases.

WRT easier iteration, we currently have {{Iterator::Visit}}:

{code}
ASSERT_OK(iter.Visit([&](T v) {
  doSomethingWith(v);
  return Status::OK();
}));
{code}

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5522) [Packaging][Documentation] Comments out of date in python/manylinux1/build_arrow.sh

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5522:
--
Labels: pull-request-available wheel  (was: wheel)

> [Packaging][Documentation] Comments out of date in 
> python/manylinux1/build_arrow.sh
> ---
>
> Key: ARROW-5522
> URL: https://issues.apache.org/jira/browse/ARROW-5522
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Packaging, Python
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available, wheel
>
> The script has this comment:
> {code:java}
> # Usage:
> #   docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh
> {code}
> However, I get:
> {code}
> Unable to find image 'arrow-base-x86_64:latest' locally
> docker: Error response from daemon: pull access denied for arrow-base-x86_64, 
> repository does not exist or may require 'docker login'.
> See 'docker run --help'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5522) [Packaging][Documentation] Comments out of date in python/manylinux1/build_arrow.sh

2019-08-28 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-5522:
---
Component/s: Documentation

> [Packaging][Documentation] Comments out of date in 
> python/manylinux1/build_arrow.sh
> ---
>
> Key: ARROW-5522
> URL: https://issues.apache.org/jira/browse/ARROW-5522
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Packaging, Python
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: wheel
>
> The script has this comment:
> {code:java}
> # Usage:
> #   docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh
> {code}
> However, I get:
> {code}
> Unable to find image 'arrow-base-x86_64:latest' locally
> docker: Error response from daemon: pull access denied for arrow-base-x86_64, 
> repository does not exist or may require 'docker login'.
> See 'docker run --help'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-5522) [Packaging][Documentation] Comments out of date in python/manylinux1/build_arrow.sh

2019-08-28 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-5522:
---
Summary: [Packaging][Documentation] Comments out of date in 
python/manylinux1/build_arrow.sh  (was: [Packaging] Comments out of date in 
python/manylinux1/build_arrow.sh)

> [Packaging][Documentation] Comments out of date in 
> python/manylinux1/build_arrow.sh
> ---
>
> Key: ARROW-5522
> URL: https://issues.apache.org/jira/browse/ARROW-5522
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: wheel
>
> The script has this comment:
> {code:java}
> # Usage:
> #   docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh
> {code}
> However, I get:
> {code}
> Unable to find image 'arrow-base-x86_64:latest' locally
> docker: Error response from daemon: pull access denied for arrow-base-x86_64, 
> repository does not exist or may require 'docker login'.
> See 'docker run --help'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (ARROW-5522) [Packaging] Comments out of date in python/manylinux1/build_arrow.sh

2019-08-28 Thread Krisztian Szucs (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-5522:
--

Assignee: Krisztian Szucs

> [Packaging] Comments out of date in python/manylinux1/build_arrow.sh
> 
>
> Key: ARROW-5522
> URL: https://issues.apache.org/jira/browse/ARROW-5522
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: wheel
>
> The script has this comment:
> {code:java}
> # Usage:
> #   docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh
> {code}
> However, I get:
> {code}
> Unable to find image 'arrow-base-x86_64:latest' locally
> docker: Error response from daemon: pull access denied for arrow-base-x86_64, 
> repository does not exist or may require 'docker login'.
> See 'docker run --help'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6356) [Java] Avro adapter implement Enum type and nested Record type

2019-08-28 Thread Ji Liu (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917667#comment-16917667
 ] 

Ji Liu commented on ARROW-6356:
---

[~emkornfi...@gmail.com] I have a question about Enum type, as mentioned in 
another thread, should this be converted to encoded vector, but how could 
implement this in a VectorSchemaRoot, if we put the encoded int vector into 
VectorSchemaRoot, then where should we put the Dictionary and how to use it?

> [Java] Avro adapter implement Enum type and nested Record type
> --
>
> Key: ARROW-6356
> URL: https://issues.apache.org/jira/browse/ARROW-6356
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>
> Implement for converting avro {{Enum}} type.
> Convert nested avro {{Record}} type to Arrow {{StructVector}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6374) [Java] Refactor the code for TimeXXVectors

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6374:
--
Labels: pull-request-available  (was: )

> [Java] Refactor the code for TimeXXVectors
> --
>
> Key: ARROW-6374
> URL: https://issues.apache.org/jira/browse/ARROW-6374
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
>
> This is based on the discussion in 
> [https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E.|https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,]
>  
> The internals of TimeXXVectors are simply IntVector or BigIntVector. There 
> are duplicated code for setting/getting int/long.
>  
> We want to refactor the code by:
>  # push get/set methods into the base class BaseFixedWidthVector, and make 
> them protected.
>  # The APIs in TimeXXVectors references the methods in the base class.
>  
> Note that this issue not just reduce redundant code, it also centralizes the 
> logics for getting/setting int/long, making them easy to maintain and change.
>  
> If it looks good, later we will make other integer based vectors rely on the 
> base class implementations. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6374) [Java] Refactor the code for TimeXXVectors

2019-08-28 Thread Liya Fan (Jira)

Liya Fan created ARROW-6374:
---

 Summary: [Java] Refactor the code for TimeXXVectors
 Key: ARROW-6374
 URL: https://issues.apache.org/jira/browse/ARROW-6374
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is based on the discussion in 
[https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E.|https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,]

 

The internals of TimeXXVectors are simply IntVector or BigIntVector. There are 
duplicated code for setting/getting int/long.

 

We want to refactor the code by:
 # push get/set methods into the base class BaseFixedWidthVector, and make them 
protected.
 # The APIs in TimeXXVectors references the methods in the base class.

 

Note that this issue not just reduce redundant code, it also centralizes the 
logics for getting/setting int/long, making them easy to maintain and change.

 

If it looks good, later we will make other integer based vectors rely on the 
base class implementations. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6352) [Java] Add implementation of DenseUnionVector.

2019-08-28 Thread Liya Fan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917633#comment-16917633
 ] 

Liya Fan commented on ARROW-6352:
-

[~emkornfi...@gmail.com] Thanks for the information. Will take a closer look.

> [Java] Add implementation of DenseUnionVector.
> --
>
> Key: ARROW-6352
> URL: https://issues.apache.org/jira/browse/ARROW-6352
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Priority: Major
>
> Today only Sparse unions are supported.  We should have a dense union 
> implementation vector that conforms to the IPC protocol (the current sparse 
> union vector doesn't do this and there are other JIRAs covering making it 
> compatible).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6373) [C++] Make FixedWidthBinaryBuilder consistent with other primitive fixed width builders

2019-08-28 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6373:
--
Labels: pull-request-available  (was: )

> [C++] Make FixedWidthBinaryBuilder consistent with other primitive fixed 
> width builders
> ---
>
> Key: ARROW-6373
> URL: https://issues.apache.org/jira/browse/ARROW-6373
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (ARROW-6373) [C++] Make FixedWidthBinaryBuilder consistent with other primitive fixed width builders

2019-08-28 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-6373:
--

 Summary: [C++] Make FixedWidthBinaryBuilder consistent with other 
primitive fixed width builders
 Key: ARROW-6373
 URL: https://issues.apache.org/jira/browse/ARROW-6373
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6352) [Java] Add implementation of DenseUnionVector.

2019-08-28 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917545#comment-16917545
 ] 

Micah Kornfield commented on ARROW-6352:


[~fan_li_ya] https://issues.apache.org/jira/browse/ARROW-1692?filter=-1 is the 
one for sparse.

> [Java] Add implementation of DenseUnionVector.
> --
>
> Key: ARROW-6352
> URL: https://issues.apache.org/jira/browse/ARROW-6352
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Priority: Major
>
> Today only Sparse unions are supported.  We should have a dense union 
> implementation vector that conforms to the IPC protocol (the current sparse 
> union vector doesn't do this and there are other JIRAs covering making it 
> compatible).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-6352) [Java] Add implementation of DenseUnionVector.

2019-08-28 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-6352:
---
Description: Today only Sparse unions are supported.  We should have a 
dense union implementation vector that conforms to the IPC protocol (the 
current sparse union vector doesn't do this and there are other JIRAs covering 
making it compatible).  (was: Today only Sparse unions are supported.  We 
should have a dense union implementation vector that conforms to the IPC 
protocol (the current spare union vector doesn't do this and there are other 
JIRAs covering making these compatible).)

> [Java] Add implementation of DenseUnionVector.
> --
>
> Key: ARROW-6352
> URL: https://issues.apache.org/jira/browse/ARROW-6352
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Priority: Major
>
> Today only Sparse unions are supported.  We should have a dense union 
> implementation vector that conforms to the IPC protocol (the current sparse 
> union vector doesn't do this and there are other JIRAs covering making it 
> compatible).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5508) [C++] Create reusable Iterator interface

2019-08-28 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917538#comment-16917538
 ] 

Micah Kornfield commented on ARROW-5508:


My only comment on this is might pay to sketch out what iteration looks like in 
each case. 

 

Using StopIteration and Result I think you get something like:

while (RETURN_IF_NOT_STOPPED_ERROR(v,  iter.next()) {

    doSomethingWith(v)

}

Not sure if the macros to make that nice are even vaible.

 

"As for signaling completion without consuming a value, well, the problem is 
that not all iterators may support that. Is there a use case where this 
matters?"

I think in many cases this can be solved with a "peek" style iterator?

 

 

> [C++] Create reusable Iterator interface 
> 
>
> Key: ARROW-5508
> URL: https://issues.apache.org/jira/browse/ARROW-5508
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> We have various iterator-like classes. I envision a reusable interface like
> {code}
> template 
> class Iterator {
>  public:
>   virtual ~Iterator() = default;
>   virtual Status Next(T* out) = 0;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-5960) [C++] Boost dependencies are specified in wrong order

2019-08-28 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917526#comment-16917526
 ] 

Ingo Müller commented on ARROW-5960:


I confirm that my problem is solved in master.

> [C++] Boost dependencies are specified in wrong order
> -
>
> Key: ARROW-5960
> URL: https://issues.apache.org/jira/browse/ARROW-5960
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Ingo Müller
>Assignee: Ingo Müller
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  The boost dependencies in cpp/CMakeLists.txt are specified in the wrong 
> order: the system library currently comes first, followed by the filesystem 
> library. They should be specified in the opposite order, as filesystem 
> depends on system.
> It seems to depend on the version of boost or how it is compiled whether this 
> problem becomes apparent. I am currently setting up the project like this:
> {code:java}
> CXX=clang++-7.0 CC=clang-7.0 \
>     cmake \
>     -DCMAKE_CXX_STANDARD=17 \
>     -DCMAKE_INSTALL_PREFIX=/tmp/arrow4/dist \
>     -DCMAKE_INSTALL_LIBDIR=lib \
>     -DARROW_WITH_RAPIDJSON=ON \
>     -DARROW_PARQUET=ON \
>     -DARROW_PYTHON=ON \
>     -DARROW_FLIGHT=OFF \
>     -DARROW_GANDIVA=OFF \
>     -DARROW_BUILD_UTILITIES=OFF \
>     -DARROW_CUDA=OFF \
>     -DARROW_ORC=OFF \
>     -DARROW_JNI=OFF \
>     -DARROW_TENSORFLOW=OFF \
>     -DARROW_HDFS=OFF \
>     -DARROW_BUILD_TESTS=OFF \
>     -DARROW_RPATH_ORIGIN=ON \
>     ..{code}
> After compiling, I libarrow.so is missing symbols:
> {code:java}
> nm -C /dist/lib/libarrow.so | grep boost::system::system_c
>  U boost::system::system_category(){code}
> It seems like this is related to whether or not boost has been compiled with 
> {{BOOST_SYSTEM_NO_DEPRECATED}}. (according to [this 
> post|https://stackoverflow.com/a/30877725/651937], anyway). I have to say 
> that I don't understand why boost as BUNDLED should be compiled that way...
> If I apply the following patch, everything works as expected:
>  
> {code:java}
> diff -pur a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
> --- a/cpp/CMakeLists.txt   2019-06-29 00:26:37.0 +0200
> +++ b/cpp/CMakeLists.txt    2019-07-16 16:36:03.980153919 +0200
> @@ -642,8 +642,8 @@ if(ARROW_STATIC_LINK_LIBS)
>    add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS})
>  endif()
> -set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_SYSTEM_LIBRARY}
> -   ${BOOST_FILESYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
> +set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} 
> ${BOOST_FILESYSTEM_LIBRARY}
> +   ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_REGEX_LIBRARY})
>  list(APPEND ARROW_STATIC_LINK_LIBS ${BOOST_SYSTEM_LIBRARY} 
> ${BOOST_FILESYSTEM_LIBRARY}
>  ${BOOST_REGEX_LIBRARY}){code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6334) [Java] Improve the dictionary builder API to return the position of the value in the dictionary

2019-08-28 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6334.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5177
[https://github.com/apache/arrow/pull/5177]

> [Java] Improve the dictionary builder API to return the position of the value 
> in the dictionary
> ---
>
> Key: ARROW-6334
> URL: https://issues.apache.org/jira/browse/ARROW-6334
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This is an improvement of the {{addValue}} method.
> Previously, the method returns a boolean, indicating if the value has been 
> successfully added to the dictionary.
> After the change, the method returns an integer, which is the position of the 
> value in the dictionary.
> The purpose of this change:
>  # the dictionary position contains more information, compared with a boolean 
> indicating if the value is added successfully.
>  # this information about the index in the dictionary can be useful, for 
> example, to collect statistics about the dictionary.
> With the dictionary position, the information about if a value has been added 
> can be easily determined.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6306) [Java] Support stable sort by stable comparators

2019-08-28 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6306.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5153
[https://github.com/apache/arrow/pull/5153]

> [Java] Support stable sort by stable comparators
> 
>
> Key: ARROW-6306
> URL: https://issues.apache.org/jira/browse/ARROW-6306
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Stable sort is desirable in many scenarios. It means equal elements preserve 
> their relative order after sorting.
> There are stable sort algorithms. However, in practice, the best sort 
> algorithm is quick sort and quick sort is not stable. 
> To make the best of both worlds, we support stable sort by stable 
> comparators. It differs from an ordinary comparator in that it breaks ties by 
> comparing the value indices.
> With the stable comparator, the quick sort algorithm becomes a stable 
> algorithm.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6136) [FlightRPC][Java] Don't double-close response stream

2019-08-28 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6136.

Resolution: Fixed

Issue resolved by pull request 5013
[https://github.com/apache/arrow/pull/5013]

> [FlightRPC][Java] Don't double-close response stream
> 
>
> Key: ARROW-6136
> URL: https://issues.apache.org/jira/browse/ARROW-6136
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Affects Versions: 0.14.1
>Reporter: lidavidm
>Assignee: lidavidm
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> DoPut in Java double-closes the metadata response stream: if the service 
> implementation sends an error down that channel, the Flight implementation 
> will unconditionally try to complete the stream, violating the gRPC semantics 
> (either an error or a completion may be sent, never both).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6297) [Java] Compare ArrowBufPointers by unsinged integers

2019-08-28 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6297.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5135
[https://github.com/apache/arrow/pull/5135]

> [Java] Compare ArrowBufPointers by unsinged integers
> 
>
> Key: ARROW-6297
> URL: https://issues.apache.org/jira/browse/ARROW-6297
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, ArrowBufPointers compare by bytes in lexicographic order. Another 
> way is to compare by unsigned integers (longs, ints, & bytes). 
> The second way involves additional bit operations for each iteration. 
> However, it can compare 8 bytes at a time. So it is overall faster:
>  
> Compare by unsigned integers:
> ArrowBufPointerBenchmarks.compareBenchmark avgt 5 65.722 ± 0.381 ns/op
>  
> Compare byte-wise:
> ArrowBufPointerBenchmarks.compareBenchmark avgt 5 681.372 ± 0.604 ns/op



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6304) [Java] Add description to each maven artifact

2019-08-28 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6304.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5151
[https://github.com/apache/arrow/pull/5151]

> [Java] Add description to each maven artifact
> -
>
> Key: ARROW-6304
> URL: https://issues.apache.org/jira/browse/ARROW-6304
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Note experimental/contrib nature of package and a brief description.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (ARROW-6113) [Java] Support vector deduplicate function

2019-08-28 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-6113.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 4993
[https://github.com/apache/arrow/pull/4993]

> [Java] Support vector deduplicate function
> --
>
> Key: ARROW-6113
> URL: https://issues.apache.org/jira/browse/ARROW-6113
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Remove adjacent deduplicated elements from a vector. This function can be 
> used, for example, in finding distinct values, or in compressing the vector 
> data.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

1 2 >

100 matches

Mail list logo