[jira] [Commented] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387263#comment-16387263
 ] 

ASF GitHub Bot commented on ARROW-2193:
---

wesm opened a new pull request #1711: WIP ARROW-2193: [C++] Do not depend on 
Boost libraries at runtime in plasma_store
URL: https://github.com/apache/arrow/pull/1711
 
 
   This is sort of a hack; I wasn't sure the way to deal with this more 
generally. Unfortunately, this only gets rid of the boost_system and 
boost_filesystem runtime dependencies. boost_regex still has a transitive 
dependency somehow


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-03-05 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2193:
--
Labels: pull-request-available  (was: )

> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2193:
---

Assignee: Wes McKinney

> [Plasma] plasma_store has runtime dependency on Boost shared libraries when 
> ARROW_BOOST_USE_SHARED=on
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2257) [C++] Add high-level option to toggle CXX11 ABI

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2257.
-
Resolution: Won't Fix
  Assignee: Wes McKinney

This is documented well enough in 
https://github.com/apache/arrow/blob/master/python/doc/source/development.rst#known-issues,
 users who run into this problem should be directed here

> [C++] Add high-level option to toggle CXX11 ABI
> ---
>
> Key: ARROW-2257
> URL: https://issues.apache.org/jira/browse/ARROW-2257
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Using gcc-4.8-based toolchain libraries from conda-forge I ran into the 
> following failure when building on Ubuntu 16.04 with clang-5.0
> {code}
> [48/48] Linking CXX executable debug/python-test
> FAILED: debug/python-test 
> : && /usr/bin/ccache /usr/bin/clang++-5.0  -ggdb -O0  -Weverything 
> -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-deprecated 
> -Wno-weak-vtables -Wno-padded -Wno-comma -Wno-unused-parameter 
> -Wno-unused-template -Wno-undef -Wno-shadow -Wno-switch-enum 
> -Wno-exit-time-destructors -Wno-global-constructors 
> -Wno-weak-template-vtables -Wno-undefined-reinterpret-cast 
> -Wno-implicit-fallthrough -Wno-unreachable-code-return -Wno-float-equal 
> -Wno-missing-prototypes -Wno-old-style-cast -Wno-covered-switch-default 
> -Wno-cast-align -Wno-vla-extension -Wno-shift-sign-overflow 
> -Wno-used-but-marked-unused -Wno-missing-variable-declarations 
> -Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-sign-conversion 
> -Wno-disabled-macro-expansion -Wno-gnu-folding-constant 
> -Wno-reserved-id-macro -Wno-range-loop-analysis -Wno-double-promotion 
> -Wno-undefined-func-template -Wno-zero-as-null-pointer-constant 
> -Wno-unknown-warning-option -Werror -std=c++11 -msse3 -maltivec -Werror 
> -D_GLIBCXX_USE_CXX11_ABI=0 -Qunused-arguments  -fsanitize=address 
> -DADDRESS_SANITIZER -fsanitize-coverage=trace-pc-guard -g  -rdynamic 
> src/arrow/python/CMakeFiles/python-test.dir/python-test.cc.o  -o 
> debug/python-test  
> -Wl,-rpath,/home/wesm/code/arrow/cpp/build/debug:/home/wesm/miniconda/envs/arrow-dev/lib:/home/wesm/cpp-toolchain/lib
>  debug/libarrow_python_test_main.a debug/libarrow_python.a 
> debug/libarrow.so.0.0.0 
> /home/wesm/miniconda/envs/arrow-dev/lib/libpython3.6m.so 
> /home/wesm/cpp-toolchain/lib/libgtest.a -lpthread -ldl 
> orc_ep-install/lib/liborc.a /home/wesm/cpp-toolchain/lib/libprotobuf.a 
> /home/wesm/cpp-toolchain/lib/libzstd.a /home/wesm/cpp-toolchain/lib/libz.a 
> /home/wesm/cpp-toolchain/lib/libsnappy.a 
> /home/wesm/cpp-toolchain/lib/liblz4.a 
> /home/wesm/cpp-toolchain/lib/libbrotlidec-static.a 
> /home/wesm/cpp-toolchain/lib/libbrotlienc-static.a 
> /home/wesm/cpp-toolchain/lib/libbrotlicommon-static.a -lpthread 
> -Wl,-rpath-link,/home/wesm/cpp-toolchain/lib && :
> debug/libarrow.so.0.0.0: undefined reference to 
> `orc::ParseError::ParseError(std::string const&)'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::io::CodedOutputStream::WriteStringWithSizeToArray(std::__cxx11::basic_string  std::char_traits, std::allocator > const&, unsigned char*)'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::internal::WireFormatLite::WriteStringMaybeAliased(int, 
> std::__cxx11::basic_string > const&, google::protobuf::io::CodedOutputStream*)'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::internal::fixed_address_empty_string[abi:cxx11]'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::internal::WireFormatLite::ReadBytes(google::protobuf::io::CodedInputStream*,
>  std::__cxx11::basic_string std::allocator >*)'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::Message::GetTypeName[abi:cxx11]() const'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::Message::InitializationErrorString[abi:cxx11]() const'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::MessageLite::SerializeToString(std::__cxx11::basic_string  std::char_traits, std::allocator >*) const'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::internal::WireFormatLite::WriteString(int, 
> std::__cxx11::basic_string > const&, google::protobuf::io::CodedOutputStream*)'
> debug/libarrow.so.0.0.0: undefined reference to 
> `google::protobuf::MessageFactory::InternalRegisterGeneratedFile(char const*, 
> void (*)(std::__cxx11::basic_string std::allocator > const&))'
> debug/libarrow.so.0.0.0: undefined reference to 
> 

[jira] [Updated] (ARROW-2265) [Python] Serializing subclasses of np.ndarray returns a np.ndarray.

2018-03-05 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2265:
--
Labels: pull-request-available  (was: )

> [Python] Serializing subclasses of np.ndarray returns a np.ndarray.
> ---
>
> Key: ARROW-2265
> URL: https://issues.apache.org/jira/browse/ARROW-2265
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2265) [Python] Serializing subclasses of np.ndarray returns a np.ndarray.

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387154#comment-16387154
 ] 

ASF GitHub Bot commented on ARROW-2265:
---

wesm closed pull request #1704: ARROW-2265: [Python] Use CheckExact when 
serializing lists and numpy arrays.
URL: https://github.com/apache/arrow/pull/1704
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/python_to_arrow.cc 
b/cpp/src/arrow/python/python_to_arrow.cc
index 6d4f64675..d781d9f0f 100644
--- a/cpp/src/arrow/python/python_to_arrow.cc
+++ b/cpp/src/arrow/python/python_to_arrow.cc
@@ -501,7 +501,7 @@ Status Append(PyObject* context, PyObject* elem, 
SequenceBuilder* builder,
   return Status::Invalid("Cannot writes bytes over 2GB");
 }
 RETURN_NOT_OK(builder->AppendString(data, static_cast(size)));
-  } else if (PyList_Check(elem)) {
+  } else if (PyList_CheckExact(elem)) {
 RETURN_NOT_OK(builder->AppendList(PyList_Size(elem)));
 sublists->push_back(elem);
   } else if (PyDict_CheckExact(elem)) {
@@ -515,7 +515,7 @@ Status Append(PyObject* context, PyObject* elem, 
SequenceBuilder* builder,
 subsets->push_back(elem);
   } else if (PyArray_IsScalar(elem, Generic)) {
 RETURN_NOT_OK(AppendScalar(elem, builder));
-  } else if (PyArray_Check(elem)) {
+  } else if (PyArray_CheckExact(elem)) {
 RETURN_NOT_OK(SerializeArray(context, 
reinterpret_cast(elem), builder,
  subdicts, blobs_out));
   } else if (elem == Py_None) {
diff --git a/python/pyarrow/tests/test_serialization.py 
b/python/pyarrow/tests/test_serialization.py
index 72315d2dc..c17408457 100644
--- a/python/pyarrow/tests/test_serialization.py
+++ b/python/pyarrow/tests/test_serialization.py
@@ -410,6 +410,33 @@ def deserialize_dummy_class(serialized_obj):
 pa.serialize(DummyClass(), context=context)
 
 
+def test_numpy_subclass_serialization():
+# Check that we can properly serialize subclasses of np.ndarray.
+class CustomNDArray(np.ndarray):
+def __new__(cls, input_array):
+array = np.asarray(input_array).view(cls)
+return array
+
+def serializer(obj):
+return {'numpy': obj.view(np.ndarray)}
+
+def deserializer(data):
+array = data['numpy'].view(CustomNDArray)
+return array
+
+context = pa.default_serialization_context()
+
+context.register_type(CustomNDArray, 'CustomNDArray',
+  custom_serializer=serializer,
+  custom_deserializer=deserializer)
+
+x = CustomNDArray(np.zeros(3))
+serialized = pa.serialize(x, context=context).to_buffer()
+new_x = pa.deserialize(serialized, context=context)
+assert type(new_x) == CustomNDArray
+assert np.alltrue(new_x.view(np.ndarray) == np.zeros(3))
+
+
 def test_buffer_serialization():
 
 class BufferClass(object):


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Serializing subclasses of np.ndarray returns a np.ndarray.
> ---
>
> Key: ARROW-2265
> URL: https://issues.apache.org/jira/browse/ARROW-2265
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2265) [Python] Serializing subclasses of np.ndarray returns a np.ndarray.

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2265.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1704
[https://github.com/apache/arrow/pull/1704]

> [Python] Serializing subclasses of np.ndarray returns a np.ndarray.
> ---
>
> Key: ARROW-2265
> URL: https://issues.apache.org/jira/browse/ARROW-2265
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2268) Remove MD5 checksums from release process

2018-03-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2268:
---

 Summary: Remove MD5 checksums from release process
 Key: ARROW-2268
 URL: https://issues.apache.org/jira/browse/ARROW-2268
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Wes McKinney
 Fix For: 0.9.0


The ASF has changed its release policy for signatures and checksums to 
contraindicate the use of MD5 checksums: 
http://www.apache.org/dev/release-distribution#sigs-and-sums. We should remove 
this from our various release scripts prior to the 0.9.0 release



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1391) [Python] Benchmarks for python serialization

2018-03-05 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387094#comment-16387094
 ] 

Wes McKinney commented on ARROW-1391:
-

See some recent work on this 
https://github.com/apache/arrow/commit/0ada87531dca52d51d4f60d3148a9ba733d96a48.
 

> [Python] Benchmarks for python serialization
> 
>
> Key: ARROW-1391
> URL: https://issues.apache.org/jira/browse/ARROW-1391
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Philipp Moritz
>Priority: Minor
>
> It would be great to have a suite of relevant benchmarks for the Python 
> serialization code in ARROW-759. These could be used to guide profiling and 
> performance improvements.
> Relevant use cases include:
> - dictionaries of large numpy arrays that are used to represent weights of a 
> neural network
> - long lists of primitive types like ints, floats or strings
> - lists of user defined python objects



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2267) Rust bindings

2018-03-05 Thread Joshua Howard (JIRA)
Joshua Howard created ARROW-2267:


 Summary: Rust bindings
 Key: ARROW-2267
 URL: https://issues.apache.org/jira/browse/ARROW-2267
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Joshua Howard


Provide Rust bindings for Arrow. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387042#comment-16387042
 ] 

ASF GitHub Bot commented on ARROW-2122:
---

adshieh commented on a change in pull request #1707: ARROW-2122: [Python] 
Pyarrow fails to serialize dataframe with timestamp.
URL: https://github.com/apache/arrow/pull/1707#discussion_r172375593
 
 

 ##
 File path: python/pyarrow/types.pxi
 ##
 @@ -847,6 +847,25 @@ cdef timeunit_to_string(TimeUnit unit):
 return 'ns'
 
 
+FIXED_OFFSET_PREFIX = '+'
 
 Review comment:
   @wesm suggestions?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387027#comment-16387027
 ] 

ASF GitHub Bot commented on ARROW-1643:
---

ehsantn commented on issue #1668: ARROW-1643: [Python] Accept hdfs:// prefixes 
in parquet.read_table and attempt to connect to HDFS
URL: https://github.com/apache/arrow/pull/1668#issuecomment-370615960
 
 
   Sorry @wesm, I was busy with unexpected tasks. Looking at it now; will 
hopefully finish tomorrow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread Siddharth Teotia (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia resolved ARROW-2199.
-
Resolution: Fixed

> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2266) [CI] Improve runtime of integration tests in Travis CI

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2266:

Description: 
I was surprised to see that travis_script_integration.sh is taking over 25 
minutes to run (https://travis-ci.org/apache/arrow/jobs/349493491). My only 
real guess about what's going on is that JVM startup time on these hosts is 
super slow.

I can think of some things we could do to make things better:

* Add debugging output so we can see what's slow
* Write a Java integration test handler that validates multiple files at once
* Generate a single set of binary files for each producer rather than 
regenerating them each time (so Java would only need to produce binary files 
once instead of 3 times like now)

  was:
I was surprised to see that travis_script_integration.sh is taking over 25 
minutes to run. My only real guess about what's going on is that JVM startup 
time on these hosts is super slow.

I can think of some things we could do to make things better:

* Add debugging output so we can see what's slow
* Write a Java integration test handler that validates multiple files at once
* Generate a single set of binary files for each producer rather than 
regenerating them each time (so Java would only need to produce binary files 
once instead of 3 times like now)


> [CI] Improve runtime of integration tests in Travis CI
> --
>
> Key: ARROW-2266
> URL: https://issues.apache.org/jira/browse/ARROW-2266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Wes McKinney
>Priority: Major
>
> I was surprised to see that travis_script_integration.sh is taking over 25 
> minutes to run (https://travis-ci.org/apache/arrow/jobs/349493491). My only 
> real guess about what's going on is that JVM startup time on these hosts is 
> super slow.
> I can think of some things we could do to make things better:
> * Add debugging output so we can see what's slow
> * Write a Java integration test handler that validates multiple files at once
> * Generate a single set of binary files for each producer rather than 
> regenerating them each time (so Java would only need to produce binary files 
> once instead of 3 times like now)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2266) [CI] Improve runtime of integration tests in Travis CI

2018-03-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2266:
---

 Summary: [CI] Improve runtime of integration tests in Travis CI
 Key: ARROW-2266
 URL: https://issues.apache.org/jira/browse/ARROW-2266
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Integration
Reporter: Wes McKinney


I was surprised to see that travis_script_integration.sh is taking over 25 
minutes to run. My only real guess about what's going on is that JVM startup 
time on these hosts is super slow.

I can think of some things we could do to make things better:

* Add debugging output so we can see what's slow
* Write a Java integration test handler that validates multiple files at once
* Generate a single set of binary files for each producer rather than 
regenerating them each time (so Java would only need to produce binary files 
once instead of 3 times like now)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386914#comment-16386914
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia closed pull request #1646: ARROW-2199: [JAVA] Control the 
memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java 
b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java
index 5411baf7b..2f70f7372 100644
--- a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java
+++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java
@@ -134,6 +134,9 @@ private static String createErrorMsg(final BufferAllocator 
allocator, final int
* @return The closest power of two of that value.
*/
   public static int nextPowerOfTwo(int val) {
+if (val == 0 || val == 1) {
+  return val + 1;
+}
 int highestBit = Integer.highestOneBit(val);
 if (highestBit == val) {
   return val;
@@ -149,6 +152,9 @@ public static int nextPowerOfTwo(int val) {
* @return The closest power of two of that value.
*/
   public static long nextPowerOfTwo(long val) {
+if (val == 0 || val == 1) {
+  return val + 1;
+}
 long highestBit = Long.highestOneBit(val);
 if (highestBit == val) {
   return val;
diff --git a/java/vector/src/main/codegen/templates/UnionVector.java 
b/java/vector/src/main/codegen/templates/UnionVector.java
index 84450bee5..1cfa0666a 100644
--- a/java/vector/src/main/codegen/templates/UnionVector.java
+++ b/java/vector/src/main/codegen/templates/UnionVector.java
@@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+assert newAllocationSize >= 1;
 
 if (newAllocationSize > BaseValueVector.MAX_ALLOCATION_SIZE) {
   throw new OversizedAllocationException("Unable to expand the buffer");
diff --git 
a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java 
b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
index cbc56fe3d..4b47df8a4 100644
--- 
a/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
+++ 
b/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java
@@ -444,6 +444,7 @@ private ArrowBuf reallocBufferHelper(ArrowBuf buffer, final 
boolean dataBuffer)
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+assert newAllocationSize >= 1;
 
 if (newAllocationSize > MAX_ALLOCATION_SIZE) {
   throw new OversizedAllocationException("Unable to expand the buffer");
diff --git 
a/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
 
b/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
index c32d20f18..ecb3c780e 100644
--- 
a/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
+++ 
b/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
@@ -174,14 +174,14 @@ public void setInitialCapacity(int valueCount) {
* @param valueCount desired number of elements in the vector
* @param density average number of bytes per variable width element
*/
+  @Override
   public void setInitialCapacity(int valueCount, double density) {
-final long size = (long) (valueCount * density);
-if (size < 1) {
-  throw new IllegalArgumentException("With the provided density and value 
count, potential capacity of the data buffer is 0");
-}
+long size = Math.max((long)(valueCount * density), 1L);
+
 if (size > MAX_ALLOCATION_SIZE) {
   throw new OversizedAllocationException("Requested amount of memory is 
more than max allowed");
 }
+
 valueAllocationSizeInBytes = (int) size;
 validityAllocationSizeInBytes = getValidityBufferSizeFromCount(valueCount);
 /* to track the end offset of last data element in vector, we need
@@ -489,6 +489,7 @@ public void reallocDataBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+assert newAllocationSize >= 1;
 
 if (newAllocationSize > MAX_ALLOCATION_SIZE) {
   throw new OversizedAllocationException("Unable to expand the buffer");
@@ -541,6 +542,7 @@ private ArrowBuf reallocBufferHelper(ArrowBuf buffer, final 
boolean offsetBuffer
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = 

[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386913#comment-16386913
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172358637
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
+ * null (no lists) or empty lists. This helps in tightly controlling
+ * the memory we provision for inner data vector.
+ *
+ * Similar analogy is applicable for VarCharVector where the capacity
+ * of the data buffer can be controlled using density multiplier
+ * instead of default multiplier of 8 (default size of average
+ * varchar length).
+ *
+ * Also from container vectors, we propagate the density down
+ * the the inner vectors so that they can use it appropriately.
+ */
+public interface DensityAwareVector {
 
 Review comment:
   I agree. This is not the only interface implemented on its own (without 
subclassing ValueVector). We have NullableVectorDefinitionSetter that provides 
the method setIndexDefined(index)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386900#comment-16386900
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172356587
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
+ * null (no lists) or empty lists. This helps in tightly controlling
+ * the memory we provision for inner data vector.
+ *
+ * Similar analogy is applicable for VarCharVector where the capacity
+ * of the data buffer can be controlled using density multiplier
+ * instead of default multiplier of 8 (default size of average
+ * varchar length).
+ *
+ * Also from container vectors, we propagate the density down
+ * the the inner vectors so that they can use it appropriately.
+ */
+public interface DensityAwareVector {
 
 Review comment:
   Ok. My feedback is while the current implementation is simple, the interface 
doesn't feel very well designed - if the interface is called 
"DensityAwareVector", I would expect it to have "Vector" like behavior, rather 
than just a single function.
   
   I prefer well designed interfaces, but I am ok with address this later as I 
don't see this being a blocker for this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386885#comment-16386885
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172354537
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
+ * null (no lists) or empty lists. This helps in tightly controlling
+ * the memory we provision for inner data vector.
+ *
+ * Similar analogy is applicable for VarCharVector where the capacity
+ * of the data buffer can be controlled using density multiplier
+ * instead of default multiplier of 8 (default size of average
+ * varchar length).
+ *
+ * Also from container vectors, we propagate the density down
+ * the the inner vectors so that they can use it appropriately.
+ */
+public interface DensityAwareVector {
 
 Review comment:
   I would like to refrain from changing the vector hierarchy at this point for 
this small change. A standalone interface does the job. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386882#comment-16386882
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on issue #1646: ARROW-2199: [JAVA] Control the memory 
allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#issuecomment-370593461
 
 
   @siddharthteotia WDYT on 
https://github.com/apache/arrow/pull/1646#discussion_r172282525


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386875#comment-16386875
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172352552
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   Ok.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386855#comment-16386855
 ] 

ASF GitHub Bot commented on ARROW-2250:
---

mitar commented on issue #1705: ARROW-2250: [Python] Do not create a subprocess 
for plasma but just use existing process
URL: https://github.com/apache/arrow/pull/1705#issuecomment-370588616
 
 
   Yes. :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386848#comment-16386848
 ] 

ASF GitHub Bot commented on ARROW-2250:
---

wesm commented on issue #1705: ARROW-2250: [Python] Do not create a subprocess 
for plasma but just use existing process
URL: https://github.com/apache/arrow/pull/1705#issuecomment-370587212
 
 
   I see, this only impacts the `plasma_store` executable created by distutils 
(https://github.com/apache/arrow/blob/master/python/setup.py#L462). OK, makes 
sense, thank you


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386840#comment-16386840
 ] 

ASF GitHub Bot commented on ARROW-2250:
---

mitar commented on issue #1705: ARROW-2250: [Python] Do not create a subprocess 
for plasma but just use existing process
URL: https://github.com/apache/arrow/pull/1705#issuecomment-370586030
 
 
   See issue ARROW-2250. So the issue is that now you have two process when you 
run `plasma_store`. Outside Python process which is not doing anything else 
besides `wait()` on the child process. If this is so, just replace the parent 
with the child. So instead of two separate processes you have one. This is what 
you/we want.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386826#comment-16386826
 ] 

ASF GitHub Bot commented on ARROW-2250:
---

wesm commented on issue #1705: ARROW-2250: [Python] Do not create a subprocess 
for plasma but just use existing process
URL: https://github.com/apache/arrow/pull/1705#issuecomment-370583795
 
 
   Could you explain the rationale for this change? My understanding is that 
the Plasma store is intended to run as a separate process; we would be remiss 
to be testing it operating in some other mode


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2264) Efficiently serialize numpy arrays with dtype of unicode fixed length string

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2264:

Component/s: Python

> Efficiently serialize numpy arrays with dtype of unicode fixed length string
> 
>
> Key: ARROW-2264
> URL: https://issues.apache.org/jira/browse/ARROW-2264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>
> Looking at the numpy array serialization code it seems that if I have a dtype 
> like " efficient one.
> {{Example:}}{{>>> np.array(['aaa', 'bbb'])}}
> {{array(['aaa', 'bbb'], dtype=' This should be able to work, no? It has fixed offsets and memory layout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2265) [Python] Serializing subclasses of np.ndarray returns a np.ndarray.

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2265:

Summary: [Python] Serializing subclasses of np.ndarray returns a 
np.ndarray.  (was: Serializing subclasses of np.ndarray returns a np.ndarray.)

> [Python] Serializing subclasses of np.ndarray returns a np.ndarray.
> ---
>
> Key: ARROW-2265
> URL: https://issues.apache.org/jira/browse/ARROW-2265
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread Mitar (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386724#comment-16386724
 ] 

Mitar commented on ARROW-2250:
--

I made an observation that parent process is unnecessary and made pull request 
which just replaces it with plasma store executable. In this way then all 
future signal are handled by the process through that executable.

This makes everything cleaner and means that it is not needed to do any signal 
passing or cleanup.

> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386719#comment-16386719
 ] 

ASF GitHub Bot commented on ARROW-2250:
---

mitar commented on issue #1705: ARROW-2250: [Python] Do not create a subprocess 
for plasma but just use existing process
URL: https://github.com/apache/arrow/pull/1705#issuecomment-370555004
 
 
   cc @robertnishihara


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386716#comment-16386716
 ] 

ASF GitHub Bot commented on ARROW-2250:
---

mitar opened a new pull request #1705: ARROW-2250: [Python] Do not create a 
subprocess for plasma but just use existing process
URL: https://github.com/apache/arrow/pull/1705
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals

2018-03-05 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2250:
--
Labels: pull-request-available  (was: )

> plasma_store process should cleanup on INT and TERM signals
> ---
>
> Key: ARROW-2250
> URL: https://issues.apache.org/jira/browse/ARROW-2250
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Mitar
>Priority: Major
>  Labels: pull-request-available
>
> Currently, if you send an INT and TERM signal to a parent plasma store 
> process (Python one) it terminates it without cleaning the child process. 
> This makes it hard to run plasma store in non-interactive mode. Inside shell 
> ctrl-c kills both processes.
> Moreover, INT prints out an ugly KeyboardInterrup exception. Probably 
> something nicer should be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386713#comment-16386713
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172318466
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   Which is why saying average list size implies the right meaning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386712#comment-16386712
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172318347
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   I think you are trying to generalize the meaning of density whereas the list 
vector could be nested too. We propagate density down the tree. So here we just 
talk about the immediate inner vector. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386707#comment-16386707
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-370553166
 
 
   @pitrou [here 
](https://github.com/apache/arrow/commit/17ee3121ff6843dd5749aa9d461abbad953cf5ef)it's
 changes to solve Jenkins failure as discussed
   [Passed Appveyor 
build](https://ci.appveyor.com/project/MaxRisuhin/arrow/build/job/8lnl668fbpadl84s)
   
   P.S. I've tried to create PR into your pitrou:ARROW-2238-cmake-clcache 
remote branch, but for some reasons it's not listed in targets during PR 
creation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386694#comment-16386694
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172316594
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   I see. Thanks for the explanation.
   
   > Density is the average size of list per position in the List vector
   
   This is fine. 
   
   >   density value of 10 implies each position in the list vector has a list 
of 10 values.
   
   If I understand correctly, a density value of 10 can be either:
   * 10 sub list of 10 values
   * 1 sub list 100 values, 9 null sublists
   * ...
   As long as the average size of sub lists equals density.
   
   Is that correct? If so, can we make it clear in the doc?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386684#comment-16386684
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on issue #1646: ARROW-2199: [JAVA] Control the memory 
allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#issuecomment-370549664
 
 
   @BryanCutler , @icexelloss , the latest commit addresses the comments w.r.t 
realloc and nextPowerOfTwo.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2265) Serializing subclasses of np.ndarray returns a np.ndarray.

2018-03-05 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-2265:
---

 Summary: Serializing subclasses of np.ndarray returns a np.ndarray.
 Key: ARROW-2265
 URL: https://issues.apache.org/jira/browse/ARROW-2265
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara
Assignee: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2264) Efficiently serialize numpy arrays with dtype of unicode fixed length string

2018-03-05 Thread Mitar (JIRA)
Mitar created ARROW-2264:


 Summary: Efficiently serialize numpy arrays with dtype of unicode 
fixed length string
 Key: ARROW-2264
 URL: https://issues.apache.org/jira/browse/ARROW-2264
 Project: Apache Arrow
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Mitar


Looking at the numpy array serialization code it seems that if I have a dtype 
like ">> np.array(['aaa', 'bbb'])}}
{{array(['aaa', 'bbb'], dtype='

[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386647#comment-16386647
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172307303
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   Density is the average size of list per position in the List vector as 
mentioned in the doc. For your example, density is 1. I don't think it is a 
good idea to generalize the purpose or usage of density. The main purpose of 
density was to be conservative about the value capacity provisioned for inner 
vectors. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-05 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2263:
---

 Summary: [Python] test_cython.py fails if pyarrow is not in import 
path (e.g. with inplace builds)
 Key: ARROW-2263
 URL: https://issues.apache.org/jira/browse/ARROW-2263
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


see 

{code}
$ py.test pyarrow/tests/test_cython.py 
= test session starts 
=
platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
collected 1 item
  

pyarrow/tests/test_cython.py F  
[100%]

== FAILURES 
===
___ test_cython_api 
___

tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')

@pytest.mark.skipif(
'ARROW_HOME' not in os.environ,
reason='ARROW_HOME environment variable not defined')
def test_cython_api(tmpdir):
"""
Basic test for the Cython API.
"""
pytest.importorskip('Cython')

ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')

test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)

with tmpdir.as_cwd():
# Set up temporary workspace
pyx_file = 'pyarrow_cython_example.pyx'
shutil.copyfile(os.path.join(here, pyx_file),
os.path.join(str(tmpdir), pyx_file))
# Create setup.py file
if os.name == 'posix':
compiler_opts = ['-std=c++11']
else:
compiler_opts = []
setup_code = setup_template.format(pyx_file=pyx_file,
   compiler_opts=compiler_opts,
   test_ld_path=test_ld_path)
with open('setup.py', 'w') as f:
f.write(setup_code)

# Compile extension module
subprocess.check_call([sys.executable, 'setup.py',
>  'build_ext', '--inplace'])

pyarrow/tests/test_cython.py:90: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _

popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
'build_ext', '--inplace'],)
kwargs = {}, retcode = 1
cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
'build_ext', '--inplace']

def check_call(*popenargs, **kwargs):
"""Run command with arguments.  Wait for command to complete.  If
the exit code was zero then return, otherwise raise
CalledProcessError.  The CalledProcessError object will have the
return code in the returncode attribute.

The arguments are the same as for the call function.  Example:

check_call(["ls", "-l"])
"""
retcode = call(*popenargs, **kwargs)
if retcode:
cmd = kwargs.get("args")
if cmd is None:
cmd = popenargs[0]
>   raise CalledProcessError(retcode, cmd)
E   subprocess.CalledProcessError: Command 
'['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
'--inplace']' returned non-zero exit status 1.

../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
CalledProcessError
 Captured stderr call 
-
Traceback (most recent call last):
  File "setup.py", line 7, in 
import pyarrow as pa
ModuleNotFoundError: No module named 'pyarrow'
== 1 failed in 0.23 seconds 
===
{code}

I encountered this bit of brittleness in a fresh install where I had not run 
{{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386634#comment-16386634
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172304362
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   What if I have a vector such that 1 out of 10 positions has a list of size 
10 and remaining positions are null, what would the density be?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386632#comment-16386632
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172304189
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386620#comment-16386620
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172302307
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   The doc doesn't mix that. It clearly indicates the purpose of density:
   
   For example, a
  *density value of 10 implies each position in the list
  *vector has a list of 10 values.
  *A density value of 0.1 implies out of 10 positions in
  *the list vector, 1 position has a list of size 1 and
  *remaining positions are null (no lists) or empty lists.
  *This helps in tightly controlling the memory we provision
  *for inner data vector.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-05 Thread Albert Shieh (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386615#comment-16386615
 ] 

Albert Shieh commented on ARROW-2122:
-

How about '+{:d}'.format(tz._minutes), or some other prefix besides '+'?

> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.9.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-05 Thread Albert Shieh (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386615#comment-16386615
 ] 

Albert Shieh edited comment on ARROW-2122 at 3/5/18 7:31 PM:
-

How about 
{code}
'+{:d}'.format(tz._minutes)
{code}
or some other prefix?


was (Author: adshieh):
How about '+{:d}'.format(tz._minutes), or some other prefix besides '+'?

> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.9.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386614#comment-16386614
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172300509
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   Sounds like for a list of 10 values. These two have the same density == 1:
   * 10 sub lists of size 1
   * 1 sub list of size 10, 9 sub list of null
   
   Is that correct understanding? The doc seems to mix the two cases so it's 
not very clear to me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1982) [Python] Return parquet statistics min/max as values instead of strings

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386608#comment-16386608
 ] 

ASF GitHub Bot commented on ARROW-1982:
---

wesm closed pull request #1698: ARROW-1982: [Python] Coerce Parquet statistics 
as bytes to more useful Python scalar types
URL: https://github.com/apache/arrow/pull/1698
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx
index e513e1d92..101fcd165 100644
--- a/python/pyarrow/_parquet.pyx
+++ b/python/pyarrow/_parquet.pyx
@@ -70,6 +70,31 @@ cdef class RowGroupStatistics:
self.num_values,
self.physical_type)
 
+cdef inline _cast_statistic(self, object value):
+# Input value is bytes
+cdef ParquetType physical_type = self.statistics.get().physical_type()
+if physical_type == ParquetType_BOOLEAN:
+return bool(int(value))
+elif physical_type == ParquetType_INT32:
+return int(value)
+elif physical_type == ParquetType_INT64:
+return int(value)
+elif physical_type == ParquetType_INT96:
+# Leave as PyBytes
+return value
+elif physical_type == ParquetType_FLOAT:
+return float(value)
+elif physical_type == ParquetType_DOUBLE:
+return float(value)
+elif physical_type == ParquetType_BYTE_ARRAY:
+# Leave as PyBytes
+return value
+elif physical_type == ParquetType_FIXED_LEN_BYTE_ARRAY:
+# Leave as PyBytes
+return value
+else:
+raise ValueError('Unknown physical ParquetType')
+
 property has_min_max:
 
 def __get__(self):
@@ -82,7 +107,7 @@ cdef class RowGroupStatistics:
 encode_min = self.statistics.get().EncodeMin()
 
 min_value = FormatStatValue(raw_physical_type, encode_min.c_str())
-return frombytes(min_value)
+return self._cast_statistic(min_value)
 
 property max:
 
@@ -91,7 +116,7 @@ cdef class RowGroupStatistics:
 encode_max = self.statistics.get().EncodeMax()
 
 max_value = FormatStatValue(raw_physical_type, encode_max.c_str())
-return frombytes(max_value)
+return self._cast_statistic(max_value)
 
 property null_count:
 
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index cec01c859..a3da05fe3 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -26,7 +26,7 @@
 
 import pytest
 
-from pyarrow.compat import guid, u, BytesIO, unichar, frombytes
+from pyarrow.compat import guid, u, BytesIO, unichar
 from pyarrow.tests import util
 from pyarrow.filesystem import LocalFileSystem
 import pyarrow as pa
@@ -524,20 +524,20 @@ def test_parquet_metadata_api():
 @pytest.mark.parametrize(
 'data, dtype, min_value, max_value, null_count, num_values',
 [
-([1, 2, 2, None, 4], np.uint8, u'1', u'4', 1, 4),
-([1, 2, 2, None, 4], np.uint16, u'1', u'4', 1, 4),
-([1, 2, 2, None, 4], np.uint32, u'1', u'4', 1, 4),
-([1, 2, 2, None, 4], np.uint64, u'1', u'4', 1, 4),
-([-1, 2, 2, None, 4], np.int16, u'-1', u'4', 1, 4),
-([-1, 2, 2, None, 4], np.int32, u'-1', u'4', 1, 4),
-([-1, 2, 2, None, 4], np.int64, u'-1', u'4', 1, 4),
-([-1.1, 2.2, 2.3, None, 4.4], np.float32, u'-1.1', u'4.4', 1, 4),
-([-1.1, 2.2, 2.3, None, 4.4], np.float64, u'-1.1', u'4.4', 1, 4),
+([1, 2, 2, None, 4], np.uint8, 1, 4, 1, 4),
+([1, 2, 2, None, 4], np.uint16, 1, 4, 1, 4),
+([1, 2, 2, None, 4], np.uint32, 1, 4, 1, 4),
+([1, 2, 2, None, 4], np.uint64, 1, 4, 1, 4),
+([-1, 2, 2, None, 4], np.int16, -1, 4, 1, 4),
+([-1, 2, 2, None, 4], np.int32, -1, 4, 1, 4),
+([-1, 2, 2, None, 4], np.int64, -1, 4, 1, 4),
+([-1.1, 2.2, 2.3, None, 4.4], np.float32, -1.1, 4.4, 1, 4),
+([-1.1, 2.2, 2.3, None, 4.4], np.float64, -1.1, 4.4, 1, 4),
 (
 [u'', u'b', unichar(1000), None, u'aaa'],
-str, u' ', frombytes((unichar(1000) + u' ').encode('utf-8')), 1, 4
+str, b' ', (unichar(1000) + u' ').encode('utf-8'), 1, 4
 ),
-([True, False, False, True, True], np.bool, u'0', u'1', 0, 5),
+([True, False, False, True, True], np.bool, False, True, 0, 5),
 ]
 )
 def test_parquet_column_statistics_api(


 


This is an automated message from the Apache Git Service.
To respond to the message, please 

[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386605#comment-16386605
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172300509
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   Sounds like for a list of 10 values. These two have the same density == 1:
   * 10 sub lists of size 1
   * 1 sub list of size 10, 9 sub list of null
   
   Is that correct understanding? The doc seems to fix the two cases so it's 
not very clear to me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386601#comment-16386601
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r17229
 
 

 ##
 File path: 
java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
 ##
 @@ -811,14 +811,9 @@ public void testSetInitialCapacity() {
   assertEquals(512, vector.getValueCapacity());
   assertEquals(8, vector.getDataVector().getValueCapacity());
 
-  boolean error = false;
-  try {
-vector.setInitialCapacity(5, 0.1);
-  } catch (IllegalArgumentException e) {
-error = true;
-  } finally {
-assertTrue(error);
-  }
+  vector.setInitialCapacity(5, 0.1);
+  vector.allocateNew();
+  assertEquals(7, vector.getValueCapacity());
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386592#comment-16386592
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172299311
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   valueCount * density is used for computing the value capacity of the inner 
vector. IF the List vector has a valuecount of 10, we use the density to 
compute the target value count for the inner vector and < 1 value of density 
helps in provisioning keeping NULL values into account.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1982) [Python] Return parquet statistics min/max as values instead of strings

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386591#comment-16386591
 ] 

ASF GitHub Bot commented on ARROW-1982:
---

wesm commented on a change in pull request #1698: ARROW-1982: [Python] Coerce 
Parquet statistics as bytes to more useful Python scalar types
URL: https://github.com/apache/arrow/pull/1698#discussion_r172299264
 
 

 ##
 File path: python/pyarrow/_parquet.pyx
 ##
 @@ -70,6 +70,28 @@ cdef class RowGroupStatistics:
self.num_values,
self.physical_type)
 
+cdef inline _cast_statistic(self, object value):
+cdef ParquetType physical_type = self.statistics.get().physical_type()
+if physical_type == ParquetType_BOOLEAN:
+return bool(int(value))
+elif physical_type == ParquetType_INT32:
+return int(value)
+elif physical_type == ParquetType_INT64:
+return int(value)
+elif physical_type == ParquetType_INT96:
+# TODO
 
 Review comment:
   OK, value is bytes here already so this can remain as, I'll remove the TODO


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Return parquet statistics min/max as values instead of strings
> ---
>
> Key: ARROW-1982
> URL: https://issues.apache.org/jira/browse/ARROW-1982
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Jim Crist
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently `min` and `max` column statistics are returned as formatted strings 
> of the _physical type_. This makes using them in python a bit tricky, as the 
> strings need to be parsed as the proper _logical type_. Observe:
> {code}
> In [20]: import pandas as pd
> In [21]: df = pd.DataFrame({'a': [1, 2, 3],
> ...:'b': ['a', 'b', 'c'],
> ...:'c': [pd.Timestamp('1991-01-01')]*3})
> ...:
> In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
> In [23]: from pyarrow import parquet as pq
> In [24]: f = pq.ParquetFile('temp.parquet')
> In [25]: rg = f.metadata.row_group(0)
> In [26]: rg.column(0).statistics.min  # string instead of integer
> Out[26]: '1'
> In [27]: rg.column(1).statistics.min  # weird space added after value due to 
> formatter
> Out[27]: 'a '
> In [28]: rg.column(2).statistics.min  # formatted as physical type (int) 
> instead of logical (datetime)
> Out[28]: '66268800'
> {code}
> Since the type information is known, it should be possible to convert these 
> to arrow values instead of strings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386547#comment-16386547
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172291183
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java
 ##
 @@ -99,6 +99,17 @@ public void setInitialCapacity(int numRecords) {
 }
   }
 
+  @Override
+  public void setInitialCapacity(int valueCount, double density) {
+for (final ValueVector vector : (Iterable) this) {
+  if (vector instanceof DensityAwareVector) {
 
 Review comment:
   Ok, SGTM.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1929) [C++] Move various Arrow testing utility code from Parquet to Arrow codebase

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386586#comment-16386586
 ] 

ASF GitHub Bot commented on ARROW-1929:
---

wesm commented on issue #1697: ARROW-1929: [C++] Copy over testing utility code 
from PARQUET-1092
URL: https://github.com/apache/arrow/pull/1697#issuecomment-370532118
 
 
   Merging, the Travis CI failure appears transient: 
https://travis-ci.org/apache/arrow/jobs/34936#L5479


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Move various Arrow testing utility code from Parquet to Arrow codebase
> 
>
> Key: ARROW-1929
> URL: https://issues.apache.org/jira/browse/ARROW-1929
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/apache/parquet-cpp/pull/426 and comments within



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1929) [C++] Move various Arrow testing utility code from Parquet to Arrow codebase

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386588#comment-16386588
 ] 

ASF GitHub Bot commented on ARROW-1929:
---

wesm closed pull request #1697: ARROW-1929: [C++] Copy over testing utility 
code from PARQUET-1092
URL: https://github.com/apache/arrow/pull/1697
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc
index c3ac08286..bda1946c6 100644
--- a/cpp/src/arrow/array-test.cc
+++ b/cpp/src/arrow/array-test.cc
@@ -2480,19 +2480,19 @@ TEST_F(TestListArray, TestFromArrays) {
 
   ListArray expected1(list_type, length, offsets1->data()->buffers[1], values,
   offsets1->data()->buffers[0], 0);
-  AssertArraysEqual(expected1, *list1);
+  test::AssertArraysEqual(expected1, *list1);
 
   // Use null bitmap from offsets3, but clean offsets from non-null version
   ListArray expected3(list_type, length, offsets1->data()->buffers[1], values,
   offsets3->data()->buffers[0], 1);
-  AssertArraysEqual(expected3, *list3);
+  test::AssertArraysEqual(expected3, *list3);
 
   // Check that the last offset bit is zero
   ASSERT_TRUE(BitUtil::BitNotSet(list3->null_bitmap()->data(), length + 1));
 
   ListArray expected4(list_type, length, offsets2->data()->buffers[1], values,
   offsets4->data()->buffers[0], 1);
-  AssertArraysEqual(expected4, *list4);
+  test::AssertArraysEqual(expected4, *list4);
 
   // Test failure modes
 
diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc
index af7441682..24c8d5e15 100644
--- a/cpp/src/arrow/table-test.cc
+++ b/cpp/src/arrow/table-test.cc
@@ -116,11 +116,11 @@ TEST_F(TestChunkedArray, SliceEquals) {
 
   std::shared_ptr slice = one_->Slice(125, 50);
   ASSERT_EQ(slice->length(), 50);
-  ASSERT_TRUE(slice->Equals(one_->Slice(125, 50)));
+  test::AssertChunkedEqual(*one_->Slice(125, 50), *slice);
 
   std::shared_ptr slice2 = one_->Slice(75)->Slice(25)->Slice(25, 
50);
   ASSERT_EQ(slice2->length(), 50);
-  ASSERT_TRUE(slice2->Equals(slice));
+  test::AssertChunkedEqual(*slice, *slice2);
 }
 
 class TestColumn : public TestChunkedArray {
@@ -390,7 +390,7 @@ TEST_F(TestTable, ConcatenateTables) {
 
   ASSERT_OK(ConcatenateTables({t1, t2}, ));
   ASSERT_OK(Table::FromRecordBatches({batch1, batch2}, ));
-  ASSERT_TRUE(result->Equals(*expected));
+  test::AssertTablesEqual(*expected, *result);
 
   // Error states
   std::vector empty_tables;
diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h
index 1a3480848..ab68fd442 100644
--- a/cpp/src/arrow/test-util.h
+++ b/cpp/src/arrow/test-util.h
@@ -35,6 +35,7 @@
 #include "arrow/memory_pool.h"
 #include "arrow/pretty_print.h"
 #include "arrow/status.h"
+#include "arrow/table.h"
 #include "arrow/type.h"
 #include "arrow/type_traits.h"
 #include "arrow/util/bit-util.h"
@@ -77,6 +78,18 @@ namespace arrow {
 
 using ArrayVector = std::vector;
 
+#define ASSERT_ARRAYS_EQUAL(LEFT, RIGHT)   
\
+  do { 
\
+if (!(LEFT).Equals((RIGHT))) { 
\
+  std::stringstream pp_result; 
\
+  std::stringstream pp_expected;   
\
+   
\
+  EXPECT_OK(PrettyPrint(RIGHT, 0, _result));
\
+  EXPECT_OK(PrettyPrint(LEFT, 0, _expected));   
\
+  FAIL() << "Got: \n" << pp_result.str() << "\nExpected: \n" << 
pp_expected.str(); \
+}  
\
+  } while (false)
+
 namespace test {
 
 template 
@@ -288,6 +301,62 @@ Status MakeRandomBytePoolBuffer(int64_t length, 
MemoryPool* pool,
   return Status::OK();
 }
 
+void AssertArraysEqual(const Array& expected, const Array& actual) {
+  ASSERT_ARRAYS_EQUAL(expected, actual);
+}
+
+void AssertChunkedEqual(const ChunkedArray& expected, const ChunkedArray& 
actual) {
+  ASSERT_EQ(expected.num_chunks(), actual.num_chunks()) << "# chunks unequal";
+  if (!actual.Equals(expected)) {
+std::stringstream pp_result;
+std::stringstream pp_expected;
+
+for (int i = 0; i < actual.num_chunks(); ++i) {
+  auto c1 = actual.chunk(i);
+  auto c2 = expected.chunk(i);
+  if (!c1->Equals(*c2)) {
+EXPECT_OK(::arrow::PrettyPrint(*c1, 0, _result));
+

[jira] [Resolved] (ARROW-1929) [C++] Move various Arrow testing utility code from Parquet to Arrow codebase

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1929.
-
Resolution: Fixed

Issue resolved by pull request 1697
[https://github.com/apache/arrow/pull/1697]

> [C++] Move various Arrow testing utility code from Parquet to Arrow codebase
> 
>
> Key: ARROW-1929
> URL: https://issues.apache.org/jira/browse/ARROW-1929
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see https://github.com/apache/parquet-cpp/pull/426 and comments within



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386581#comment-16386581
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172297486
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   I think there are probably too many such checks `newAllocationSize = 
Math.max(newAllocationSize, 1);` across the code, safeguarding realloc is fine 
but the way it's currently implemented feels too scattered and error prone. 
(Some one could forget to add this check in some vectors in the future, for 
instance)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386573#comment-16386573
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172295709
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
 ##
 @@ -174,14 +174,14 @@ public void setInitialCapacity(int valueCount) {
* @param valueCount desired number of elements in the vector
* @param density average number of bytes per variable width element
*/
+  @Override
   public void setInitialCapacity(int valueCount, double density) {
-final long size = (long) (valueCount * density);
-if (size < 1) {
-  throw new IllegalArgumentException("With the provided density and value 
count, potential capacity of the data buffer is 0");
-}
+long size = Math.max((long)(valueCount * density), 1L);
 
 Review comment:
   Ok. I don't have strong opinion on the behavior for the case `valueCount * 
density < 1`, I guess what you are saying make sense. 
   
   In that case, can we make that clear in the documentation (maybe at the 
interface level)? Also, this is consistent with 
`setInitialCapacity(valueCount)`? i.e Does `setInitialCapacity(0)` also adjust 
to 1 as well?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386569#comment-16386569
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172295366
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   I just don't feel strong need for it. safeguarding realloc seems perfectly 
fine to me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386558#comment-16386558
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172293201
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   If we want to be safe, I think we can create `nextPowerOfTwoZeroSafe` to not 
affect the existing users of `nextPowerOfTwo`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386549#comment-16386549
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

BryanCutler commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172291035
 
 

 ##
 File path: 
java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
 ##
 @@ -811,14 +811,9 @@ public void testSetInitialCapacity() {
   assertEquals(512, vector.getValueCapacity());
   assertEquals(8, vector.getDataVector().getValueCapacity());
 
-  boolean error = false;
-  try {
-vector.setInitialCapacity(5, 0.1);
-  } catch (IllegalArgumentException e) {
-error = true;
-  } finally {
-assertTrue(error);
-  }
+  vector.setInitialCapacity(5, 0.1);
+  vector.allocateNew();
+  assertEquals(7, vector.getValueCapacity());
 
 Review comment:
   how about adding `assertEquals(1, 
vector.getDataVector().getValueCapacity())` and also maybe a brief explanation 
of the values?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386532#comment-16386532
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172289379
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   Should that function be fixed as part of this patch? I am not even sure what 
are the implications of that in other parts of Arrow and/or downstream 
consumers. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1982) [Python] Return parquet statistics min/max as values instead of strings

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386518#comment-16386518
 ] 

ASF GitHub Bot commented on ARROW-1982:
---

xhochy commented on a change in pull request #1698: ARROW-1982: [Python] Coerce 
Parquet statistics as bytes to more useful Python scalar types
URL: https://github.com/apache/arrow/pull/1698#discussion_r172287015
 
 

 ##
 File path: python/pyarrow/_parquet.pyx
 ##
 @@ -70,6 +70,28 @@ cdef class RowGroupStatistics:
self.num_values,
self.physical_type)
 
+cdef inline _cast_statistic(self, object value):
+cdef ParquetType physical_type = self.statistics.get().physical_type()
+if physical_type == ParquetType_BOOLEAN:
+return bool(int(value))
+elif physical_type == ParquetType_INT32:
+return int(value)
+elif physical_type == ParquetType_INT64:
+return int(value)
+elif physical_type == ParquetType_INT96:
+# TODO
 
 Review comment:
   We should return also `bytes` here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Return parquet statistics min/max as values instead of strings
> ---
>
> Key: ARROW-1982
> URL: https://issues.apache.org/jira/browse/ARROW-1982
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Jim Crist
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently `min` and `max` column statistics are returned as formatted strings 
> of the _physical type_. This makes using them in python a bit tricky, as the 
> strings need to be parsed as the proper _logical type_. Observe:
> {code}
> In [20]: import pandas as pd
> In [21]: df = pd.DataFrame({'a': [1, 2, 3],
> ...:'b': ['a', 'b', 'c'],
> ...:'c': [pd.Timestamp('1991-01-01')]*3})
> ...:
> In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
> In [23]: from pyarrow import parquet as pq
> In [24]: f = pq.ParquetFile('temp.parquet')
> In [25]: rg = f.metadata.row_group(0)
> In [26]: rg.column(0).statistics.min  # string instead of integer
> Out[26]: '1'
> In [27]: rg.column(1).statistics.min  # weird space added after value due to 
> formatter
> Out[27]: 'a '
> In [28]: rg.column(2).statistics.min  # formatted as physical type (int) 
> instead of logical (datetime)
> Out[28]: '66268800'
> {code}
> Since the type information is known, it should be possible to convert these 
> to arrow values instead of strings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386515#comment-16386515
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

BryanCutler commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172286775
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java
 ##
 @@ -99,6 +99,17 @@ public void setInitialCapacity(int numRecords) {
 }
   }
 
+  @Override
+  public void setInitialCapacity(int valueCount, double density) {
+for (final ValueVector vector : (Iterable) this) {
+  if (vector instanceof DensityAwareVector) {
 
 Review comment:
   I agree, I think it's best to just check the class instance where needed 
instead of delegating


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386511#comment-16386511
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172286159
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   I agree with Bryan.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386504#comment-16386504
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172285488
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   Yeah I think it makes sense that density of 0.1 means 10% of the value is 
non-null. What I am not sure about is why the size of the non-null value has 
size 1 ? It seems `valuecount * density` is used for  both (1) number of 
non-null sub lists in the parent list (2) (average) length of the non-null sub 
lists. What if these two values are different?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386492#comment-16386492
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172282743
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java
 ##
 @@ -174,14 +174,14 @@ public void setInitialCapacity(int valueCount) {
* @param valueCount desired number of elements in the vector
* @param density average number of bytes per variable width element
*/
+  @Override
   public void setInitialCapacity(int valueCount, double density) {
-final long size = (long) (valueCount * density);
-if (size < 1) {
-  throw new IllegalArgumentException("With the provided density and value 
count, potential capacity of the data buffer is 0");
-}
+long size = Math.max((long)(valueCount * density), 1L);
 
 Review comment:
   Because it is better to internally set the initial capacity to 1 as opposed 
to throwing exception.
   In our code, we invoke this in a loop dynamically adjusting the density 
value and stepping down the initial capacity because we are working with fix 
memory reservation and limits.
   
   So setInitialCapacity() followed by allocateNew() might fail in the second 
step if not enough memory. So we restart by adjusting the density and stepping 
down the value count. Throwing an exception doesn't help in any case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386499#comment-16386499
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

BryanCutler commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r17228
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   Well, technically given 0 the next power of 2 should be 1.  So I think that 
function needs the fix and then the extra check here should be safe to remove.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386495#comment-16386495
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on issue #1646: ARROW-2199: [JAVA] Control the memory 
allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#issuecomment-370516903
 
 
   @BryanCutler , @icexelloss , I have addressed and responded to review 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386494#comment-16386494
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172283361
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   10 * 0.1  is the initial capacity we will provision for the inner vector in 
the mentioned example. So only 1 value in the inner vector of the list for 10 
outer positions in the list vector. Does that make sense? Think about a mix of 
null and non-null lists.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386493#comment-16386493
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on issue #1646: ARROW-2199: [JAVA] Control the memory 
allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#issuecomment-370516083
 
 
   I left some comments for the change. Sorry for the delay. (I forgot about 
this after I came back from vacation)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386491#comment-16386491
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172282525
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
+ * null (no lists) or empty lists. This helps in tightly controlling
+ * the memory we provision for inner data vector.
+ *
+ * Similar analogy is applicable for VarCharVector where the capacity
+ * of the data buffer can be controlled using density multiplier
+ * instead of default multiplier of 8 (default size of average
+ * varchar length).
+ *
+ * Also from container vectors, we propagate the density down
+ * the the inner vectors so that they can use it appropriately.
+ */
+public interface DensityAwareVector {
 
 Review comment:
   Should this be a sub interface of `ValueVector`? It feels a bit strange this 
interface is called `Vector` but doesn't define any vector methods.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386490#comment-16386490
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172282525
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
+ * null (no lists) or empty lists. This helps in tightly controlling
+ * the memory we provision for inner data vector.
+ *
+ * Similar analogy is applicable for VarCharVector where the capacity
+ * of the data buffer can be controlled using density multiplier
+ * instead of default multiplier of 8 (default size of average
+ * varchar length).
+ *
+ * Also from container vectors, we propagate the density down
+ * the the inner vectors so that they can use it appropriately.
+ */
+public interface DensityAwareVector {
 
 Review comment:
   Should this be a sub interface of `ValueVector`? It feels a bit strange this 
interface is called `Vector` but only has one method.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386486#comment-16386486
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172281889
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/DensityAwareVector.java
 ##
 @@ -0,0 +1,57 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.arrow.vector;
+
+/**
+ * Vector that support density aware initial capacity settings.
+ * We use this for ListVector and VarCharVector as of now to
+ * control the memory allocated.
+ *
+ * For ListVector, we have been using a multiplier of 5
+ * to compute the initial capacity of the inner data vector.
+ * For deeply nested lists and lists with lots of NULL values,
+ * this is over-allocation upfront. So density helps to be
+ * conservative when computing the value capacity of the
+ * inner vector.
+ *
+ * For example, a density value of 10 implies each position in the
+ * list vector has a list of 10 values. So we will provision
+ * an initial capacity of (valuecount * 10) for the inner vector.
+ * A density value of 0.1 implies out of 10 positions in the list vector,
+ * 1 position has a list of size 1 and remaining positions are
 
 Review comment:
   Sorry I don't understand this completely. Why is the 1 position that is not 
null has size == 1?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386487#comment-16386487
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172281991
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java
 ##
 @@ -99,6 +99,17 @@ public void setInitialCapacity(int numRecords) {
 }
   }
 
+  @Override
+  public void setInitialCapacity(int valueCount, double density) {
+for (final ValueVector vector : (Iterable) this) {
+  if (vector instanceof DensityAwareVector) {
 
 Review comment:
   I think that's unnecessary delegation and then there is no purpose of having 
a DensityAwareVector interface since typically everyone will implement the 
interface -- some will delegate and rest will implement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386483#comment-16386483
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172281445
 
 

 ##
 File path: 
java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java
 ##
 @@ -1933,15 +1933,6 @@ public void testSetInitialCapacity() {
   vector.allocateNew();
   assertEquals(4096, vector.getValueCapacity());
   assertEquals(64, vector.getDataBuffer().capacity());
-
-  boolean error = false;
-  try {
-vector.setInitialCapacity(5, 0.1);
 
 Review comment:
   added back the test with assertion


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386484#comment-16386484
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172281516
 
 

 ##
 File path: 
java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
 ##
 @@ -810,15 +810,6 @@ public void testSetInitialCapacity() {
   vector.allocateNew();
   assertEquals(512, vector.getValueCapacity());
   assertEquals(8, vector.getDataVector().getValueCapacity());
-
-  boolean error = false;
-  try {
-vector.setInitialCapacity(5, 0.1);
 
 Review comment:
   you are right. added back the test with assertion.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386479#comment-16386479
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172281168
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java
 ##
 @@ -99,6 +99,17 @@ public void setInitialCapacity(int numRecords) {
 }
   }
 
+  @Override
+  public void setInitialCapacity(int valueCount, double density) {
+for (final ValueVector vector : (Iterable) this) {
+  if (vector instanceof DensityAwareVector) {
 
 Review comment:
   What do you think of having all vectors implement
   `setInitialCapacity(valueCount, density)` and delegate to 
`setInitialCapacity(valueCount)`
   instead of case matching here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386481#comment-16386481
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172281369
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   The only thing we want to ensure is that if realloc is starting with 
existing capacity as 0, the caller should not run into an infinite loop and 
that's why we do max and set it to 1 if needed. Adding an assertion sounds fine 
but that implicitly asks for removing the setting to math.max (blah, 1). I am 
suggesting to keep the max setting but I don't have a strong opinion


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386472#comment-16386472
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172280459
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   Also, I think there are too many such statements. Can we put it in 
`nextPowerOfTwo` or maybe create a new method `nextPowerofTwoZeroSafe`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386474#comment-16386474
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172280459
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   Also, I think there are too many such statements. Can we put it in 
`nextPowerOfTwo` or maybe create a new method `nextPowerOfTwoZeroSafe`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1388) [Python] Add Table.drop method for removing columns

2018-03-05 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-1388:
--

Assignee: (was: Uwe L. Korn)

> [Python] Add Table.drop method for removing columns
> ---
>
> Key: ARROW-1388
> URL: https://issues.apache.org/jira/browse/ARROW-1388
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See ARROW-1374 for a use case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386461#comment-16386461
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

icexelloss commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172279395
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   Sounds like we have the assumption that newAllocationSize should not be 0 
because setInitializeCapacity prevents 0. I think it's better to use assert to 
validation the assumption. It feels more robust. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1388) [Python] Add Table.drop method for removing columns

2018-03-05 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1388:
---
Fix Version/s: 0.10.0

> [Python] Add Table.drop method for removing columns
> ---
>
> Key: ARROW-1388
> URL: https://issues.apache.org/jira/browse/ARROW-1388
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> See ARROW-1374 for a use case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray

2018-03-05 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2262:
--
Labels: pull-request-available  (was: )

> [Python] Support slicing on pyarrow.ChunkedArray
> 
>
> Key: ARROW-2262
> URL: https://issues.apache.org/jira/browse/ARROW-2262
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386453#comment-16386453
 ] 

ASF GitHub Bot commented on ARROW-2262:
---

xhochy opened a new pull request #1702: ARROW-2262: [Python] Support slicing on 
pyarrow.ChunkedArray
URL: https://github.com/apache/arrow/pull/1702
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support slicing on pyarrow.ChunkedArray
> 
>
> Key: ARROW-2262
> URL: https://issues.apache.org/jira/browse/ARROW-2262
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386450#comment-16386450
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

siddharthteotia commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172277608
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   nextPowerOfTwo of 0 is returned as 0 by BaseAllocator.nextPowerOfTwo which 
is why we initially safeguarded the realloc function to be aware of 0 initial 
capacity. Now that setInitialCapacity prevents 0 initial capacity, doing a 
check in realloc may not be absolutely necessary but I suggest we should keep 
it -- no harm


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray

2018-03-05 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-2262:
--

Assignee: Uwe L. Korn

> [Python] Support slicing on pyarrow.ChunkedArray
> 
>
> Key: ARROW-2262
> URL: https://issues.apache.org/jira/browse/ARROW-2262
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray

2018-03-05 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2262:
--

 Summary: [Python] Support slicing on pyarrow.ChunkedArray
 Key: ARROW-2262
 URL: https://issues.apache.org/jira/browse/ARROW-2262
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Uwe L. Korn
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386416#comment-16386416
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

BryanCutler commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172269530
 
 

 ##
 File path: java/vector/src/main/codegen/templates/UnionVector.java
 ##
 @@ -282,6 +282,7 @@ private void reallocTypeBuffer() {
 
 long newAllocationSize = baseSize * 2L;
 newAllocationSize = BaseAllocator.nextPowerOfTwo(newAllocationSize);
+newAllocationSize = Math.max(newAllocationSize, 1);
 
 Review comment:
   I guess I mean is it possible to be negative?  If not then 
`newAllocationSize` is a long so the only other possibility is 0 and wouldn't 
`nextPowerOfTwo` change the 0 to a 1?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2199) [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386411#comment-16386411
 ] 

ASF GitHub Bot commented on ARROW-2199:
---

BryanCutler commented on a change in pull request #1646: ARROW-2199: [JAVA] 
Control the memory allocated for inner vectors in containers.
URL: https://github.com/apache/arrow/pull/1646#discussion_r172268388
 
 

 ##
 File path: 
java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java
 ##
 @@ -810,15 +810,6 @@ public void testSetInitialCapacity() {
   vector.allocateNew();
   assertEquals(512, vector.getValueCapacity());
   assertEquals(8, vector.getDataVector().getValueCapacity());
-
-  boolean error = false;
-  try {
-vector.setInitialCapacity(5, 0.1);
 
 Review comment:
   Yes, that I what I mean.  Still call `vector.setInitialCapacity(5, 0.1);` 
and just assert that capacity is 1 instead of trying to catch the exception


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JAVA] Follow up fixes for ARROW-2019. Ensure density driven capacity is 
> never less than 1 and propagate density throughout the vector tree
> ---
>
> Key: ARROW-2199
> URL: https://issues.apache.org/jira/browse/ARROW-2199
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2259) [C++] importing pyarrow segfaults in boost_regex

2018-03-05 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2259.
---
Resolution: Duplicate

dup of ARROW-2247

> [C++] importing pyarrow segfaults in boost_regex
> 
>
> Key: ARROW-2259
> URL: https://issues.apache.org/jira/browse/ARROW-2259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> This is new (started on changeset bfac60dd73bffa5f7bcefc890486268036182278) 
> and seems related to the use of boost_regex. I am building on Ubuntu 16.04 
> with the {{boost-cpp}} package from conda-forge.
> Here is the gdb backtrace:
> {code}
> #0  std::string::_Rep::_M_is_leaked (this=this@entry=0xffe8)
> at 
> /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h:3075
> #1  0x71014856 in std::string::_Rep::_M_grab 
> (this=0xffe8, __alloc1=..., __alloc2=...)
> at 
> /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h:3126
> #2  0x7101489d in std::basic_string std::allocator >::basic_string (this=0x7fffa0e0, __str=...)
> at 
> /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:613
> #3  0x70a791fc in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() ()
>from 
> /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libboost_regex.so.1.66.0
> #4  0x70ac1803 in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) ()
>from 
> /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libboost_regex.so.1.66.0
> #5  0x70acb62b in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from 
> /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libboost_regex.so.1.66.0
> #6  0x7182b6cb in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fffa700, 
> p1=0x718a61e2 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  p2=0x718a622a "", f=0)
> at 
> /home/antoine/miniconda3/envs/pyarrow/include/boost/regex/v4/basic_regex.hpp:381
> #7  0x7182b657 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fffa700, 
> p=0x718a61e2 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at 
> /home/antoine/miniconda3/envs/pyarrow/include/boost/regex/v4/basic_regex.hpp:366
> #8  0x7180e103 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fffa700, 
> p=0x718a61e2 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at 
> /home/antoine/miniconda3/envs/pyarrow/include/boost/regex/v4/basic_regex.hpp:335
> #9  0x7180b430 in parquet::ApplicationVersion::ApplicationVersion 
> (this=0x71b98bf8 
> , 
> created_by=...) at /home/antoine/parquet-cpp/src/parquet/metadata.cc:452
> #10 0x716bfa21 in __cxx_global_var_init.1(void) () at 
> /home/antoine/parquet-cpp/src/parquet/metadata.cc:35
> #11 0x716bfbfe in _GLOBAL__sub_I_metadata.stdout.fsol.16106.d9N3Ps.ii 
> () from /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libparquet.so.1
> #12 0x77de76ba in call_init (l=, argc=argc@entry=3, 
> argv=argv@entry=0x7fffd7b8, env=env@entry=0x7fffd7d8) at dl-init.c:72
> #13 0x77de77cb in call_init (env=0x7fffd7d8, argv=0x7fffd7b8, 
> argc=3, l=) at dl-init.c:30
> #14 _dl_init (main_map=main_map@entry=0x8acbb0, argc=3, argv=0x7fffd7b8, 
> env=0x7fffd7d8) at dl-init.c:120
> #15 0x77dec8e2 in dl_open_worker (a=a@entry=0x7fffaa90) at 
> dl-open.c:575
> #16 0x77de7564 in _dl_catch_error 
> (objname=objname@entry=0x7fffaa80, 
> errstring=errstring@entry=0x7fffaa88, 
> mallocedp=mallocedp@entry=0x7fffaa7f, 
> operate=operate@entry=0x77dec4d0 , 
> args=args@entry=0x7fffaa90) at dl-error.c:187
> #17 0x77debda9 in _dl_open (file=0x76238ae0 
> 

[jira] [Commented] (ARROW-2261) [GLib] Can't share the same memory in GArrowBuffer safely

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386304#comment-16386304
 ] 

ASF GitHub Bot commented on ARROW-2261:
---

kou opened a new pull request #1701: ARROW-2261: [GLib] Improve memory 
management for GArrowBuffer data
URL: https://github.com/apache/arrow/pull/1701
 
 
   This change introduces GBytes constructors to GArrowBuffer and
   GArrowMutableBuffer. GBytes has reference count feature. It means that
   we can share the same memory safely.
   
   We can't share the same memory safely with the current raw guint8
   constructor.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [GLib] Can't share the same memory in GArrowBuffer safely
> -
>
> Key: ARROW-2261
> URL: https://issues.apache.org/jira/browse/ARROW-2261
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 0.8.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2261) [GLib] Can't share the same memory in GArrowBuffer safely

2018-03-05 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2261:
--
Labels: pull-request-available  (was: )

> [GLib] Can't share the same memory in GArrowBuffer safely
> -
>
> Key: ARROW-2261
> URL: https://issues.apache.org/jira/browse/ARROW-2261
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 0.8.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2259) [C++] importing pyarrow segfaults in boost_regex

2018-03-05 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386302#comment-16386302
 ] 

Wes McKinney commented on ARROW-2259:
-

[~pitrou] this is a duplicate of ARROW-2247, let's investigate there

> [C++] importing pyarrow segfaults in boost_regex
> 
>
> Key: ARROW-2259
> URL: https://issues.apache.org/jira/browse/ARROW-2259
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Major
>
> This is new (started on changeset bfac60dd73bffa5f7bcefc890486268036182278) 
> and seems related to the use of boost_regex. I am building on Ubuntu 16.04 
> with the {{boost-cpp}} package from conda-forge.
> Here is the gdb backtrace:
> {code}
> #0  std::string::_Rep::_M_is_leaked (this=this@entry=0xffe8)
> at 
> /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h:3075
> #1  0x71014856 in std::string::_Rep::_M_grab 
> (this=0xffe8, __alloc1=..., __alloc2=...)
> at 
> /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.h:3126
> #2  0x7101489d in std::basic_string std::allocator >::basic_string (this=0x7fffa0e0, __str=...)
> at 
> /home/msarahan/miniconda2/conda-bld/compilers_linux-64_1507259624353/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:613
> #3  0x70a791fc in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() ()
>from 
> /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libboost_regex.so.1.66.0
> #4  0x70ac1803 in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) ()
>from 
> /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libboost_regex.so.1.66.0
> #5  0x70acb62b in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from 
> /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libboost_regex.so.1.66.0
> #6  0x7182b6cb in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fffa700, 
> p1=0x718a61e2 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  p2=0x718a622a "", f=0)
> at 
> /home/antoine/miniconda3/envs/pyarrow/include/boost/regex/v4/basic_regex.hpp:381
> #7  0x7182b657 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fffa700, 
> p=0x718a61e2 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at 
> /home/antoine/miniconda3/envs/pyarrow/include/boost/regex/v4/basic_regex.hpp:366
> #8  0x7180e103 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fffa700, 
> p=0x718a61e2 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at 
> /home/antoine/miniconda3/envs/pyarrow/include/boost/regex/v4/basic_regex.hpp:335
> #9  0x7180b430 in parquet::ApplicationVersion::ApplicationVersion 
> (this=0x71b98bf8 
> , 
> created_by=...) at /home/antoine/parquet-cpp/src/parquet/metadata.cc:452
> #10 0x716bfa21 in __cxx_global_var_init.1(void) () at 
> /home/antoine/parquet-cpp/src/parquet/metadata.cc:35
> #11 0x716bfbfe in _GLOBAL__sub_I_metadata.stdout.fsol.16106.d9N3Ps.ii 
> () from /home/antoine/miniconda3/envs/pyarrow/bin/../lib/libparquet.so.1
> #12 0x77de76ba in call_init (l=, argc=argc@entry=3, 
> argv=argv@entry=0x7fffd7b8, env=env@entry=0x7fffd7d8) at dl-init.c:72
> #13 0x77de77cb in call_init (env=0x7fffd7d8, argv=0x7fffd7b8, 
> argc=3, l=) at dl-init.c:30
> #14 _dl_init (main_map=main_map@entry=0x8acbb0, argc=3, argv=0x7fffd7b8, 
> env=0x7fffd7d8) at dl-init.c:120
> #15 0x77dec8e2 in dl_open_worker (a=a@entry=0x7fffaa90) at 
> dl-open.c:575
> #16 0x77de7564 in _dl_catch_error 
> (objname=objname@entry=0x7fffaa80, 
> errstring=errstring@entry=0x7fffaa88, 
> mallocedp=mallocedp@entry=0x7fffaa7f, 
> operate=operate@entry=0x77dec4d0 , 
> args=args@entry=0x7fffaa90) at dl-error.c:187
> #17 0x77debda9 in _dl_open 

[jira] [Created] (ARROW-2261) [GLib] Can't share the same memory in GArrowBuffer safely

2018-03-05 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-2261:
---

 Summary: [GLib] Can't share the same memory in GArrowBuffer safely
 Key: ARROW-2261
 URL: https://issues.apache.org/jira/browse/ARROW-2261
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Affects Versions: 0.8.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2247) [Python] Statically-linking boost_regex in both libarrow and libparquet results in segfault

2018-03-05 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386277#comment-16386277
 ] 

Wes McKinney commented on ARROW-2247:
-

At a glance, it looks like a great deal of work. We should collectively assess 
what's the best way to build a more cohesive / productive build system for the 
projects. We also have the matter of the ASF release process

> [Python] Statically-linking boost_regex in both libarrow and libparquet 
> results in segfault
> ---
>
> Key: ARROW-2247
> URL: https://issues.apache.org/jira/browse/ARROW-2247
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Priority: Major
>
> This is a backtrace loading {{libparquet.so}} on Ubuntu 14.04 using boost 
> 1.66.1 from conda-forge. Both libarrow and libparquet contain {{boost_regex}} 
> statically linked. 
> {code}
> In [1]: import ctypes
> In [2]: ctypes.CDLL('libparquet.so')
> Program received signal SIGSEGV, Segmentation fault.
> 0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> (gdb) bt
> #0  0x7fffed4ad3fb in std::basic_string std::allocator >::basic_string(std::string const&) () from 
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #1  0x7fffed74c1fc in 
> boost::re_detail_106600::cpp_regex_traits_char_layer::init() ()
>from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #2  0x7fffed794803 in 
> boost::object_cache boost::re_detail_106600::cpp_regex_traits_implementation 
> >::do_get(boost::re_detail_106600::cpp_regex_traits_base const&, 
> unsigned long) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #3  0x7fffed79e62b in boost::basic_regex boost::cpp_regex_traits > >::do_assign(char const*, char const*, 
> unsigned int) () from /home/wesm/cpp-toolchain/lib/libboost_regex.so.1.66.0
> #4  0x7fffee58561b in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p1=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  
> p2=0x7fffee60064a "", f=0) at 
> /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:381
> #5  0x7fffee5855a7 in boost::basic_regex boost::cpp_regex_traits > >::assign (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:366
> #6  0x7fffee5683f3 in boost::basic_regex boost::cpp_regex_traits > >::basic_regex (this=0x7fff3780, 
> p=0x7fffee600602 
> "(.*?)\\s*(?:(version\\s*(?:([^(]*?)\\s*(?:\\(\\s*build\\s*([^)]*?)\\s*\\))?)?)?)",
>  f=0)
> at /home/wesm/cpp-toolchain/include/boost/regex/v4/basic_regex.hpp:335
> #7  0x7fffee5656d0 in parquet::ApplicationVersion::ApplicationVersion (
> Python Exception  There is no member named _M_dataplus.: 
> this=0x7fffee8f1fb8 
> , created_by=)
> at ../src/parquet/metadata.cc:452
> #8  0x7fffee41c271 in __cxx_global_var_init.1(void) () at 
> ../src/parquet/metadata.cc:35
> #9  0x7fffee41c44e in _GLOBAL__sub_I_metadata.tmp.wesm_desktop.4838.ii ()
>from /home/wesm/local/lib/libparquet.so
> #10 0x77dea1da in call_init (l=, argc=argc@entry=2, 
> argv=argv@entry=0x7fff5d88, 
> env=env@entry=0x7fff5da0) at dl-init.c:78
> #11 0x77dea2c3 in call_init (env=, argv= out>, argc=, 
> l=) at dl-init.c:36
> #12 _dl_init (main_map=main_map@entry=0x13fb220, argc=2, argv=0x7fff5d88, 
> env=0x7fff5da0)
> at dl-init.c:126
> {code}
> This seems to be caused by static initializations in libparquet:
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/metadata.cc#L34
> We should see if removing these static initializations makes the problem go 
> away. If not, then statically-linking boost_regex in both libraries is not 
> advisable.
> For this reason and more, I really wish that Arrow and Parquet shared a 
> common build system and monorepo structure -- it would make handling these 
> toolchain and build-related issues much simpler. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-05 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386274#comment-16386274
 ] 

Wes McKinney commented on ARROW-2122:
-

Ah, sounds like we need a convention to handle FixedOffset in Arrow so that 
things can be properly coerced going in and out of Python

> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.9.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-03-05 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386267#comment-16386267
 ] 

Wes McKinney commented on ARROW-2195:
-

Right probably need to make Plasma buffers work like the {{MemoryMap}} in 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L593 that 
manages the lifetime of the mapped memory created by the {{MemoryMappedFile}} 
class

> [Plasma] Segfault when retrieving RecordBatch from plasma store
> ---
>
> Key: ARROW-2195
> URL: https://issues.apache.org/jira/browse/ARROW-2195
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.9.0
>
>
> It can be reproduced with the following script:
> {code:python}
> import pyarrow as pa
> import pyarrow.plasma as plasma
> def retrieve1():
> client = plasma.connect('test', "", 0)
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> [buff] = client .get_buffers([pid])
> batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> print(batch)
> print(batch.schema)
> print(batch[0])
> return batch
> client = plasma.connect('test', "", 0)
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
> bff = client.create(pid, sink.size())
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
> {code}
>  
> Preliminary backtrace:
>  
> {code}
> CESS (code=1, address=0x38158)
>     frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
> ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
>     0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
> PyInt_FromLong
>     0x10e645805 <+37>: testq  %rax, %rax
>     0x10e645808 <+40>: je     0x10e64580c               ; <+44>
> (lldb) bt
>  * thread #1: tid = 0xf1378e, 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
> queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
> address=0x38158)
>   * frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>     frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 
> 133
>     frame #2: 0x00010e613b25 
> lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
>     frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
>     frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-03-05 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386262#comment-16386262
 ] 

Antoine Pitrou commented on ARROW-2195:
---

That would indeed solve the ForeignBuffer issue, but not the Plasma issue, 
right? That is, the solution only applies when the buffer is created from 
Python...

> [Plasma] Segfault when retrieving RecordBatch from plasma store
> ---
>
> Key: ARROW-2195
> URL: https://issues.apache.org/jira/browse/ARROW-2195
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.9.0
>
>
> It can be reproduced with the following script:
> {code:python}
> import pyarrow as pa
> import pyarrow.plasma as plasma
> def retrieve1():
> client = plasma.connect('test', "", 0)
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> [buff] = client .get_buffers([pid])
> batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> print(batch)
> print(batch.schema)
> print(batch[0])
> return batch
> client = plasma.connect('test', "", 0)
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
> bff = client.create(pid, sink.size())
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
> {code}
>  
> Preliminary backtrace:
>  
> {code}
> CESS (code=1, address=0x38158)
>     frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
> ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
>     0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
> PyInt_FromLong
>     0x10e645805 <+37>: testq  %rax, %rax
>     0x10e645808 <+40>: je     0x10e64580c               ; <+44>
> (lldb) bt
>  * thread #1: tid = 0xf1378e, 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
> queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
> address=0x38158)
>   * frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>     frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 
> 133
>     frame #2: 0x00010e613b25 
> lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
>     frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
>     frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2254) [Python] Local in-place dev versions picking up JS tags

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386249#comment-16386249
 ] 

ASF GitHub Bot commented on ARROW-2254:
---

wesm closed pull request #1699: ARROW-2254: [Python] Ignore JS tags in local 
dev versions
URL: https://github.com/apache/arrow/pull/1699
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py
index 8cb4b3b9b..28ac98ea0 100644
--- a/python/pyarrow/__init__.py
+++ b/python/pyarrow/__init__.py
@@ -23,8 +23,22 @@
 except DistributionNotFound:
# package is not installed
 try:
+# This code is duplicated from setup.py to avoid a dependency on each
+# other.
+def parse_version(root):
+from setuptools_scm import version_from_scm
+import setuptools_scm.git
+describe = setuptools_scm.git.DEFAULT_DESCRIBE + " --match 
'apache-arrow-[0-9]*'"
+# Strip catchall from the commandline
+describe = describe.replace("--match *.*", "")
+version = setuptools_scm.git.parse(root, describe)
+if not version:
+return version_from_scm(root)
+else:
+return version
+
 import setuptools_scm
-__version__ = setuptools_scm.get_version('../')
+__version__ = setuptools_scm.get_version('../', parse=parse_version)
 except (ImportError, LookupError):
 __version__ = None
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Local in-place dev versions picking up JS tags
> ---
>
> Key: ARROW-2254
> URL: https://issues.apache.org/jira/browse/ARROW-2254
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I thought we had fixed this bug, but it's back:
> {code}
> $ ipython
> Python 3.5.2 | packaged by conda-forge | (default, Jul 26 2016, 01:32:08) 
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
> In [1]: pa.__version__
> Out[1]: '0.3.1.dev52+g8b1c8118'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >