[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2017-04-10 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963751#comment-15963751
 ] 

Kazuaki Ishizaki commented on ARROW-300:


Current Apache Spark supports [the following compression 
schemes|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressionScheme.scala#L66]
 for in-memory columnar storage. Currently, compressed in-memory columnar 
storage is used when DataFrame.cache or Dataset.cache method is executed.
Would it be possible to support these schemes in addition to 
LZ4/(current)DictonaryEncoding?

* RunLengthEncoding: Generic run-length encoding (e.g. 1,1,1,2,2,2,2 -> [3, 1], 
[4, 2])
* IntDelta: Represent a sequence using a base value with byte deltas from 
previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3])
* LongDelta: Represent a sequence using a base value with byte deltas from 
previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3])


> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-779) [C++/Python] Raise exception if old metadata encountered

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-779.

Resolution: Fixed

Issue resolved by pull request 507
[https://github.com/apache/arrow/pull/507]

> [C++/Python] Raise exception if old metadata encountered
> 
>
> Key: ARROW-779
> URL: https://issues.apache.org/jira/browse/ARROW-779
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> For the moment, we intend for Arrow users to develop head-to-head, i.e. old 
> metadata will not be supported. This will help prevent issues caused by 
> upgrading one component (e.g. pyarrow) but not another (e.g. the Java JARs)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-801) [JAVA] Provide direct access to underlying buffer memory addresses in consistent way without generating garbage or large amount indirections

2017-04-10 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963281#comment-15963281
 ] 

Julien Le Dem commented on ARROW-801:
-

Sounds good.
Rather than adding them at the FieldVector level, it sounds like they should be 
on the vectors that support them. We can have base classes or interfaces for 
FixWidthVector, VariableWidthVector, ...
And then we define what each supports.

> [JAVA] Provide direct access to underlying buffer memory addresses in 
> consistent way without generating garbage or large amount indirections
> 
>
> Key: ARROW-801
> URL: https://issues.apache.org/jira/browse/ARROW-801
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Jacques Nadeau
>
> When working with Arrow vectors recently, we observed a situation where our 
> time was dominated  by calls to getFieldBuffers() to be able to retrieve 
> memory addresses (22s out of 26s total for a piece of code). We should 
> provide a direct mechanism to access this data so we can avoid all the extra 
> indirection and object creation. 
> A proposal:
> getBitAddress();
> getDataAddress();
> getOffsetAddress();
> These interfaces would be made available at the FieldVector interface and 
> simply throw UnsupportedOperationException where not supported.
> Unsupported Operations: 
> data for list type
> offset for fixed width types
> data and offset for struct type
> data for union type



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (ARROW-801) [JAVA] Provide direct access to underlying buffer memory addresses in consistent way without generating garbage or large amount indirections

2017-04-10 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created ARROW-801:


 Summary: [JAVA] Provide direct access to underlying buffer memory 
addresses in consistent way without generating garbage or large amount 
indirections
 Key: ARROW-801
 URL: https://issues.apache.org/jira/browse/ARROW-801
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: Jacques Nadeau


When working with Arrow vectors recently, we observed a situation where our 
time was dominated  by calls to getFieldBuffers() to be able to retrieve memory 
addresses (22s out of 26s total for a piece of code). We should provide a 
direct mechanism to access this data so we can avoid all the extra indirection 
and object creation. 

A proposal:
getBitAddress();
getDataAddress();
getOffsetAddress();

These interfaces would be made available at the FieldVector interface and 
simply throw UnsupportedOperationException where not supported.

Unsupported Operations: 
data for list type
offset for fixed width types
data and offset for struct type
data for union type



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-725) [Format] Constant length list type

2017-04-10 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963154#comment-15963154
 ] 

Wes McKinney commented on ARROW-725:


Sure thing, just added the link

> [Format] Constant length list type
> --
>
> Key: ARROW-725
> URL: https://issues.apache.org/jira/browse/ARROW-725
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Brian Hulette
>Assignee: Emilio Lahr-Vivaz
>Priority: Trivial
>
> It makes sense to store some data in a row-based format. For example, a 
> position might be stored as two or three coordinates per row, and all of them 
> will almost always be accessed simultaneously. Currently, arrow must store 
> these as two or three separate vectors, but cache performance could 
> potentially be improved if every coordinate for a given row were in the same 
> location in memory.
> The List type could satisfy this requirement, but it requires an additional 
> offset vector which isn't necessary when every element is the same size. I 
> think it would be helpful to define a new type that is essentially a List 
> with every element having the same length. I think "Tuple" would be a natural 
> fit for this type but I'm open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-725) [Format] Constant length list type

2017-04-10 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963146#comment-15963146
 ] 

Emilio Lahr-Vivaz commented on ARROW-725:
-

I'd like to get this into the 0.3 release if possible - can I add it as a 
blocker?

> [Format] Constant length list type
> --
>
> Key: ARROW-725
> URL: https://issues.apache.org/jira/browse/ARROW-725
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Brian Hulette
>Assignee: Emilio Lahr-Vivaz
>Priority: Trivial
>
> It makes sense to store some data in a row-based format. For example, a 
> position might be stored as two or three coordinates per row, and all of them 
> will almost always be accessed simultaneously. Currently, arrow must store 
> these as two or three separate vectors, but cache performance could 
> potentially be improved if every coordinate for a given row were in the same 
> location in memory.
> The List type could satisfy this requirement, but it requires an additional 
> offset vector which isn't necessary when every element is the same size. I 
> think it would be helpful to define a new type that is essentially a List 
> with every element having the same length. I think "Tuple" would be a natural 
> fit for this type but I'm open to other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-741) [Python] Add Python 3.6 to Travis CI

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-741.

Resolution: Fixed

Issue resolved by pull request 514
[https://github.com/apache/arrow/pull/514]

> [Python] Add Python 3.6 to Travis CI
> 
>
> Key: ARROW-741
> URL: https://issues.apache.org/jira/browse/ARROW-741
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> We need to make sure the next release of PyArrow works on Python 3.6



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-761) [Python] Add function to compute the total size of tensor payloads, including metadata and padding

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-761.

Resolution: Fixed

Issue resolved by pull request 521
[https://github.com/apache/arrow/pull/521]

> [Python] Add function to compute the total size of tensor payloads, including 
> metadata and padding
> --
>
> Key: ARROW-761
> URL: https://issues.apache.org/jira/browse/ARROW-761
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> This will be useful for ensuring that memory maps have enough space available 
> to write out a data structure
> cc [~pcmoritz] [~robertnishihara]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-796) [Java] Checkstyle additions causing build failure in some environments

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-796.

Resolution: Not A Problem
  Assignee: Wes McKinney

IntelliJ shipped with Maven 3.0.x, so closing this as a non-issue. If others 
run into this problem we can refer them here (or they'll find this on Google 
search)

> [Java] Checkstyle additions causing build failure in some environments
> --
>
> Key: ARROW-796
> URL: https://issues.apache.org/jira/browse/ARROW-796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> Even after the conflict fixed in ARROW-677, I'm running into build problems:
> {code}
> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further 
> details.
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Apache Arrow Java Root POM  FAILURE [0.586s]
> [INFO] Arrow Format .. SKIPPED
> [INFO] Arrow Memory .. SKIPPED
> [INFO] Arrow Vectors . SKIPPED
> [INFO] Arrow Tools ... SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 0.742s
> [INFO] Finished at: Sat Apr 08 17:11:40 EDT 2017
> [INFO] Final Memory: 20M/633M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on 
> project arrow-java-root: Execution validate of goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check failed: An API 
> incompatibility was encountered while executing 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check: 
> java.lang.AbstractMethodError: 
> org.slf4j.impl.JDK14LoggerAdapter.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
> [ERROR] -
> [ERROR] realm =
> plugin>org.apache.maven.plugins:maven-checkstyle-plugin:2.17
> {code}
> If I remove the checkstyle plugin from the root pom.xml, everything is OK



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ARROW-796) [Java] Checkstyle additions causing build failure in some environments

2017-04-10 Thread Emilio Lahr-Vivaz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962875#comment-15962875
 ] 

Emilio Lahr-Vivaz commented on ARROW-796:
-

Yeah, I've been using 3.3.9 and didn't have a problem. I guess the readme does 
say 'maven 3.3 or later'. What version of maven was intellij using?

> [Java] Checkstyle additions causing build failure in some environments
> --
>
> Key: ARROW-796
> URL: https://issues.apache.org/jira/browse/ARROW-796
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Wes McKinney
> Fix For: 0.3.0
>
>
> Even after the conflict fixed in ARROW-677, I'm running into build problems:
> {code}
> SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible 
> with [1.6, 1.7]
> SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further 
> details.
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Apache Arrow Java Root POM  FAILURE [0.586s]
> [INFO] Arrow Format .. SKIPPED
> [INFO] Arrow Memory .. SKIPPED
> [INFO] Arrow Vectors . SKIPPED
> [INFO] Arrow Tools ... SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 0.742s
> [INFO] Finished at: Sat Apr 08 17:11:40 EDT 2017
> [INFO] Final Memory: 20M/633M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on 
> project arrow-java-root: Execution validate of goal 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check failed: An API 
> incompatibility was encountered while executing 
> org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check: 
> java.lang.AbstractMethodError: 
> org.slf4j.impl.JDK14LoggerAdapter.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
> [ERROR] -
> [ERROR] realm =
> plugin>org.apache.maven.plugins:maven-checkstyle-plugin:2.17
> {code}
> If I remove the checkstyle plugin from the root pom.xml, everything is OK



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-782) [C++] Change struct to class for objects that meet the criteria in the Google style guide

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-782.

Resolution: Fixed

Issue resolved by pull request 520
[https://github.com/apache/arrow/pull/520]

> [C++] Change struct to class for objects that meet the criteria in the Google 
> style guide
> -
>
> Key: ARROW-782
> URL: https://issues.apache.org/jira/browse/ARROW-782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> See https://google.github.io/styleguide/cppguide.html#Structs_vs._Classes. I 
> have suspected that the types in {{type.h}} should be classes, but this 
> suggests it pretty strongly. It would be better to address this sooner rather 
> than later. 
> We should also make the member access functions instead of bare attributes, 
> e.g. {{type->id()}} instead of {{type->type}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-795) [C++] Combine libarrow/libarrow_io/libarrow_ipc

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-795.

Resolution: Fixed

Issue resolved by pull request 516
[https://github.com/apache/arrow/pull/516]

> [C++] Combine libarrow/libarrow_io/libarrow_ipc
> ---
>
> Key: ARROW-795
> URL: https://issues.apache.org/jira/browse/ARROW-795
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>
> From initial thirdparty users, it seems likely that users will link to all of 
> these (or at least libarrow/libarrow_io) or none of them. It may be simpler 
> for thirdparties for these core libraries to be a single link target instead 
> of multiple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-526) [Format] Update IPC.md to account for File format changes and Streaming format

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-526.

Resolution: Fixed

Issue resolved by pull request 515
[https://github.com/apache/arrow/pull/515]

> [Format] Update IPC.md to account for File format changes and Streaming format
> --
>
> Key: ARROW-526
> URL: https://issues.apache.org/jira/browse/ARROW-526
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (ARROW-794) [C++] Check whether data is contiguous in ipc::WriteTensor

2017-04-10 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-794.

Resolution: Fixed

Issue resolved by pull request 519
[https://github.com/apache/arrow/pull/519]

> [C++] Check whether data is contiguous in ipc::WriteTensor
> --
>
> Key: ARROW-794
> URL: https://issues.apache.org/jira/browse/ARROW-794
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)