[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963751#comment-15963751 ] Kazuaki Ishizaki commented on ARROW-300: Current Apache Spark supports [the following compression schemes|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/compression/CompressionScheme.scala#L66] for in-memory columnar storage. Currently, compressed in-memory columnar storage is used when DataFrame.cache or Dataset.cache method is executed. Would it be possible to support these schemes in addition to LZ4/(current)DictonaryEncoding? * RunLengthEncoding: Generic run-length encoding (e.g. 1,1,1,2,2,2,2 -> [3, 1], [4, 2]) * IntDelta: Represent a sequence using a base value with byte deltas from previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3]) * LongDelta: Represent a sequence using a base value with byte deltas from previous one. (e.g. 1,3,5,7,10 -> [1, 2, 2, 2, 3]) > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-779) [C++/Python] Raise exception if old metadata encountered
[ https://issues.apache.org/jira/browse/ARROW-779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-779. Resolution: Fixed Issue resolved by pull request 507 [https://github.com/apache/arrow/pull/507] > [C++/Python] Raise exception if old metadata encountered > > > Key: ARROW-779 > URL: https://issues.apache.org/jira/browse/ARROW-779 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > > For the moment, we intend for Arrow users to develop head-to-head, i.e. old > metadata will not be supported. This will help prevent issues caused by > upgrading one component (e.g. pyarrow) but not another (e.g. the Java JARs) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-801) [JAVA] Provide direct access to underlying buffer memory addresses in consistent way without generating garbage or large amount indirections
[ https://issues.apache.org/jira/browse/ARROW-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963281#comment-15963281 ] Julien Le Dem commented on ARROW-801: - Sounds good. Rather than adding them at the FieldVector level, it sounds like they should be on the vectors that support them. We can have base classes or interfaces for FixWidthVector, VariableWidthVector, ... And then we define what each supports. > [JAVA] Provide direct access to underlying buffer memory addresses in > consistent way without generating garbage or large amount indirections > > > Key: ARROW-801 > URL: https://issues.apache.org/jira/browse/ARROW-801 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: Jacques Nadeau > > When working with Arrow vectors recently, we observed a situation where our > time was dominated by calls to getFieldBuffers() to be able to retrieve > memory addresses (22s out of 26s total for a piece of code). We should > provide a direct mechanism to access this data so we can avoid all the extra > indirection and object creation. > A proposal: > getBitAddress(); > getDataAddress(); > getOffsetAddress(); > These interfaces would be made available at the FieldVector interface and > simply throw UnsupportedOperationException where not supported. > Unsupported Operations: > data for list type > offset for fixed width types > data and offset for struct type > data for union type -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (ARROW-801) [JAVA] Provide direct access to underlying buffer memory addresses in consistent way without generating garbage or large amount indirections
Jacques Nadeau created ARROW-801: Summary: [JAVA] Provide direct access to underlying buffer memory addresses in consistent way without generating garbage or large amount indirections Key: ARROW-801 URL: https://issues.apache.org/jira/browse/ARROW-801 Project: Apache Arrow Issue Type: Bug Components: Java - Vectors Reporter: Jacques Nadeau When working with Arrow vectors recently, we observed a situation where our time was dominated by calls to getFieldBuffers() to be able to retrieve memory addresses (22s out of 26s total for a piece of code). We should provide a direct mechanism to access this data so we can avoid all the extra indirection and object creation. A proposal: getBitAddress(); getDataAddress(); getOffsetAddress(); These interfaces would be made available at the FieldVector interface and simply throw UnsupportedOperationException where not supported. Unsupported Operations: data for list type offset for fixed width types data and offset for struct type data for union type -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-725) [Format] Constant length list type
[ https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963154#comment-15963154 ] Wes McKinney commented on ARROW-725: Sure thing, just added the link > [Format] Constant length list type > -- > > Key: ARROW-725 > URL: https://issues.apache.org/jira/browse/ARROW-725 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Brian Hulette >Assignee: Emilio Lahr-Vivaz >Priority: Trivial > > It makes sense to store some data in a row-based format. For example, a > position might be stored as two or three coordinates per row, and all of them > will almost always be accessed simultaneously. Currently, arrow must store > these as two or three separate vectors, but cache performance could > potentially be improved if every coordinate for a given row were in the same > location in memory. > The List type could satisfy this requirement, but it requires an additional > offset vector which isn't necessary when every element is the same size. I > think it would be helpful to define a new type that is essentially a List > with every element having the same length. I think "Tuple" would be a natural > fit for this type but I'm open to other suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-725) [Format] Constant length list type
[ https://issues.apache.org/jira/browse/ARROW-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963146#comment-15963146 ] Emilio Lahr-Vivaz commented on ARROW-725: - I'd like to get this into the 0.3 release if possible - can I add it as a blocker? > [Format] Constant length list type > -- > > Key: ARROW-725 > URL: https://issues.apache.org/jira/browse/ARROW-725 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Brian Hulette >Assignee: Emilio Lahr-Vivaz >Priority: Trivial > > It makes sense to store some data in a row-based format. For example, a > position might be stored as two or three coordinates per row, and all of them > will almost always be accessed simultaneously. Currently, arrow must store > these as two or three separate vectors, but cache performance could > potentially be improved if every coordinate for a given row were in the same > location in memory. > The List type could satisfy this requirement, but it requires an additional > offset vector which isn't necessary when every element is the same size. I > think it would be helpful to define a new type that is essentially a List > with every element having the same length. I think "Tuple" would be a natural > fit for this type but I'm open to other suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-741) [Python] Add Python 3.6 to Travis CI
[ https://issues.apache.org/jira/browse/ARROW-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-741. Resolution: Fixed Issue resolved by pull request 514 [https://github.com/apache/arrow/pull/514] > [Python] Add Python 3.6 to Travis CI > > > Key: ARROW-741 > URL: https://issues.apache.org/jira/browse/ARROW-741 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > > We need to make sure the next release of PyArrow works on Python 3.6 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-761) [Python] Add function to compute the total size of tensor payloads, including metadata and padding
[ https://issues.apache.org/jira/browse/ARROW-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-761. Resolution: Fixed Issue resolved by pull request 521 [https://github.com/apache/arrow/pull/521] > [Python] Add function to compute the total size of tensor payloads, including > metadata and padding > -- > > Key: ARROW-761 > URL: https://issues.apache.org/jira/browse/ARROW-761 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > > This will be useful for ensuring that memory maps have enough space available > to write out a data structure > cc [~pcmoritz] [~robertnishihara] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-796) [Java] Checkstyle additions causing build failure in some environments
[ https://issues.apache.org/jira/browse/ARROW-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-796. Resolution: Not A Problem Assignee: Wes McKinney IntelliJ shipped with Maven 3.0.x, so closing this as a non-issue. If others run into this problem we can refer them here (or they'll find this on Google search) > [Java] Checkstyle additions causing build failure in some environments > -- > > Key: ARROW-796 > URL: https://issues.apache.org/jira/browse/ARROW-796 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > > Even after the conflict fixed in ARROW-677, I'm running into build problems: > {code} > SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible > with [1.6, 1.7] > SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further > details. > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Apache Arrow Java Root POM FAILURE [0.586s] > [INFO] Arrow Format .. SKIPPED > [INFO] Arrow Memory .. SKIPPED > [INFO] Arrow Vectors . SKIPPED > [INFO] Arrow Tools ... SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 0.742s > [INFO] Finished at: Sat Apr 08 17:11:40 EDT 2017 > [INFO] Final Memory: 20M/633M > [INFO] > > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on > project arrow-java-root: Execution validate of goal > org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check failed: An API > incompatibility was encountered while executing > org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check: > java.lang.AbstractMethodError: > org.slf4j.impl.JDK14LoggerAdapter.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V > [ERROR] - > [ERROR] realm = > plugin>org.apache.maven.plugins:maven-checkstyle-plugin:2.17 > {code} > If I remove the checkstyle plugin from the root pom.xml, everything is OK -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (ARROW-796) [Java] Checkstyle additions causing build failure in some environments
[ https://issues.apache.org/jira/browse/ARROW-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962875#comment-15962875 ] Emilio Lahr-Vivaz commented on ARROW-796: - Yeah, I've been using 3.3.9 and didn't have a problem. I guess the readme does say 'maven 3.3 or later'. What version of maven was intellij using? > [Java] Checkstyle additions causing build failure in some environments > -- > > Key: ARROW-796 > URL: https://issues.apache.org/jira/browse/ARROW-796 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: Wes McKinney > Fix For: 0.3.0 > > > Even after the conflict fixed in ARROW-677, I'm running into build problems: > {code} > SLF4J: The requested version 1.5.6 by your slf4j binding is not compatible > with [1.6, 1.7] > SLF4J: See http://www.slf4j.org/codes.html#version_mismatch for further > details. > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Apache Arrow Java Root POM FAILURE [0.586s] > [INFO] Arrow Format .. SKIPPED > [INFO] Arrow Memory .. SKIPPED > [INFO] Arrow Vectors . SKIPPED > [INFO] Arrow Tools ... SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 0.742s > [INFO] Finished at: Sat Apr 08 17:11:40 EDT 2017 > [INFO] Final Memory: 20M/633M > [INFO] > > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check (validate) on > project arrow-java-root: Execution validate of goal > org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check failed: An API > incompatibility was encountered while executing > org.apache.maven.plugins:maven-checkstyle-plugin:2.17:check: > java.lang.AbstractMethodError: > org.slf4j.impl.JDK14LoggerAdapter.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V > [ERROR] - > [ERROR] realm = > plugin>org.apache.maven.plugins:maven-checkstyle-plugin:2.17 > {code} > If I remove the checkstyle plugin from the root pom.xml, everything is OK -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-782) [C++] Change struct to class for objects that meet the criteria in the Google style guide
[ https://issues.apache.org/jira/browse/ARROW-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-782. Resolution: Fixed Issue resolved by pull request 520 [https://github.com/apache/arrow/pull/520] > [C++] Change struct to class for objects that meet the criteria in the Google > style guide > - > > Key: ARROW-782 > URL: https://issues.apache.org/jira/browse/ARROW-782 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > > See https://google.github.io/styleguide/cppguide.html#Structs_vs._Classes. I > have suspected that the types in {{type.h}} should be classes, but this > suggests it pretty strongly. It would be better to address this sooner rather > than later. > We should also make the member access functions instead of bare attributes, > e.g. {{type->id()}} instead of {{type->type}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-795) [C++] Combine libarrow/libarrow_io/libarrow_ipc
[ https://issues.apache.org/jira/browse/ARROW-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-795. Resolution: Fixed Issue resolved by pull request 516 [https://github.com/apache/arrow/pull/516] > [C++] Combine libarrow/libarrow_io/libarrow_ipc > --- > > Key: ARROW-795 > URL: https://issues.apache.org/jira/browse/ARROW-795 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > > From initial thirdparty users, it seems likely that users will link to all of > these (or at least libarrow/libarrow_io) or none of them. It may be simpler > for thirdparties for these core libraries to be a single link target instead > of multiple. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-526) [Format] Update IPC.md to account for File format changes and Streaming format
[ https://issues.apache.org/jira/browse/ARROW-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-526. Resolution: Fixed Issue resolved by pull request 515 [https://github.com/apache/arrow/pull/515] > [Format] Update IPC.md to account for File format changes and Streaming format > -- > > Key: ARROW-526 > URL: https://issues.apache.org/jira/browse/ARROW-526 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (ARROW-794) [C++] Check whether data is contiguous in ipc::WriteTensor
[ https://issues.apache.org/jira/browse/ARROW-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-794. Resolution: Fixed Issue resolved by pull request 519 [https://github.com/apache/arrow/pull/519] > [C++] Check whether data is contiguous in ipc::WriteTensor > -- > > Key: ARROW-794 > URL: https://issues.apache.org/jira/browse/ARROW-794 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)