Re: Timestamps with different precision / Timedeltas
+1 On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinneywrote: > hello, > > For the current iteration of Arrow, can we agree to support int64 UNIX > timestamps with a particular resolution (second through nanosecond), > as these are reasonably common representations? We can look to expand > later if it is needed. > > Thanks > Wes > > On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney wrote: > > Bumping this discussion. As part of finalizing a v1 Arrow spec (for > > purposes of moving data between systems, at minimum) we should propose > > timestamp metadata and physical memory representation that maximizes > > interoperability with other systems. It seems like a fixed decimal > > would meet this requirement as UNIX-like timestamps at some resolution > > could pass unmodified with appropriate metadata. > > > > We will also need decimal types in Arrow (at least to accommodate > > common database representations and file formats like Parquet), so > > this seems like a reasonable potential hierarchy of types: > > > > Timestamp [logical type] > > extends FixedDecimal [logical type] > > extends FixedWidth [physical type] > > > > I did a bit of internet searching but did not find a canonical > > reference or implementation of fixed decimals; that would be helpful. > > > > As an aside: for floating decimal numbers for numerical data we could > > utilize an implementation like http://www.bytereef.org/mpdecimal/ > > which implements the spec described at > > http://speleotrove.com/decimal/decarith.html > > > > Thanks > > Wes > > > > On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel > wrote: > >> Hi all, > >> > >> May I suggest that instead of fixed-point decimals, you consider a more > >> general fixed-denominator rational representation, for times and other > >> purposes? Powers of ten are convenient for humans, but powers of two > more > >> efficient. For some applications, the efficiency of bit operations over > >> divmod is more useful than an exact representation of integral > nanoseconds. > >> > >> std::chrono takes this approach. I'll also humbly point you at my own > >> date/time library, https://github.com/alexhsamuel/cron (incomplete but > >> basically working), which may provide ideas or useful code. It was > intended > >> for precisely this sort of application. > >> > >> Regards, > >> Alex > >> > >> > >> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn wrote: > >> > >>> I agree with that having a Decimal type for timestamps is a nice > >>> definition. Haying your time encoded as seconds or nanoseconds should > be > >>> the same as having a scale of the respective amount. But I would rather > >>> avoid having a separate decimal physical type. Therefore I'd prefer the > >>> parquet approach where decimal is only a logical type and backed by > >>> either a bytearray, int32 or int64. > >>> > >>> Thus a more general timestamp could look like: > >>> > >>> * Decimals are logical types, physical types are the same as defined in > >>> Parquet [1] > >>> * Base unit for timestamps is seconds, you can get milliseconds and > >>> nanoseconds by using a different scale. .(Note that seconds and so on > >>> are all powers of ten, thus matching the specification of decimal scale > >>> really good). > >>> * Timestamp is just another logical type that is referring to Decimal > >>> (and optionally may have a timezone) and signalling that we have a Time > >>> and not just a "simple" decimal. > >>> * For a first iteration, I would assume no timezone or UTC but not > >>> include a metadata field. Once we're sure the implementation works, we > >>> can add metadata about it. > >>> > >>> Timedeltas could be addressed in a similar way, just without the need > >>> for a timezone. > >>> > >>> For my usages, I don't have the use-case for a larger than int64 > >>> timestamp and would like to have it exactly as such in my computation, > >>> thus my preference for the Parquet way. > >>> > >>> Uwe > >>> > >>> [1] > >>> > >>> https://github.com/apache/parquet-format/blob/master/ > LogicalTypes.md#decimal > >>> > >>> On 13.07.16 03:06, Julian Hyde wrote: > >>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle > >>> > numbers are floating decimal. They have a few nice properties, but > >>> > they are variable width and can get quite large. I've seen one or two > >>> > systems that started with binary flo > >> > >> > >>> * Base unit for timestamps is seconds, you can get milliseconds and > >> > >> nanoseconds by using a different scale. .(Note that seconds and so on > >> > >> are all powers of ten, thus matching the specification of decimal scale > >> > >> really good). > >> > >> * Timestamp is just another logical type that is referring to Decimal > >> > >> (and optionally may have a timezone) and signalling that we have a Tim > >> > >> ating point numbers, which are > >>> > much worse for business computing, and then change to Java > BigDecimal, > >>> >
[jira] [Commented] (ARROW-96) C++: API documentation using Doxygen
[ https://issues.apache.org/jira/browse/ARROW-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534318#comment-15534318 ] Wes McKinney commented on ARROW-96: --- {{///}} sounds fine to me. Once we get a docs build up we can start being more diligent about writing API documentation > C++: API documentation using Doxygen > - > > Key: ARROW-96 > URL: https://issues.apache.org/jira/browse/ARROW-96 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > > For the developers using Arrow via C++, we should provide an automatically > generated API documentation via doxygen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Timestamps with different precision / Timedeltas
hello, For the current iteration of Arrow, can we agree to support int64 UNIX timestamps with a particular resolution (second through nanosecond), as these are reasonably common representations? We can look to expand later if it is needed. Thanks Wes On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinneywrote: > Bumping this discussion. As part of finalizing a v1 Arrow spec (for > purposes of moving data between systems, at minimum) we should propose > timestamp metadata and physical memory representation that maximizes > interoperability with other systems. It seems like a fixed decimal > would meet this requirement as UNIX-like timestamps at some resolution > could pass unmodified with appropriate metadata. > > We will also need decimal types in Arrow (at least to accommodate > common database representations and file formats like Parquet), so > this seems like a reasonable potential hierarchy of types: > > Timestamp [logical type] > extends FixedDecimal [logical type] > extends FixedWidth [physical type] > > I did a bit of internet searching but did not find a canonical > reference or implementation of fixed decimals; that would be helpful. > > As an aside: for floating decimal numbers for numerical data we could > utilize an implementation like http://www.bytereef.org/mpdecimal/ > which implements the spec described at > http://speleotrove.com/decimal/decarith.html > > Thanks > Wes > > On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel wrote: >> Hi all, >> >> May I suggest that instead of fixed-point decimals, you consider a more >> general fixed-denominator rational representation, for times and other >> purposes? Powers of ten are convenient for humans, but powers of two more >> efficient. For some applications, the efficiency of bit operations over >> divmod is more useful than an exact representation of integral nanoseconds. >> >> std::chrono takes this approach. I'll also humbly point you at my own >> date/time library, https://github.com/alexhsamuel/cron (incomplete but >> basically working), which may provide ideas or useful code. It was intended >> for precisely this sort of application. >> >> Regards, >> Alex >> >> >> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn wrote: >> >>> I agree with that having a Decimal type for timestamps is a nice >>> definition. Haying your time encoded as seconds or nanoseconds should be >>> the same as having a scale of the respective amount. But I would rather >>> avoid having a separate decimal physical type. Therefore I'd prefer the >>> parquet approach where decimal is only a logical type and backed by >>> either a bytearray, int32 or int64. >>> >>> Thus a more general timestamp could look like: >>> >>> * Decimals are logical types, physical types are the same as defined in >>> Parquet [1] >>> * Base unit for timestamps is seconds, you can get milliseconds and >>> nanoseconds by using a different scale. .(Note that seconds and so on >>> are all powers of ten, thus matching the specification of decimal scale >>> really good). >>> * Timestamp is just another logical type that is referring to Decimal >>> (and optionally may have a timezone) and signalling that we have a Time >>> and not just a "simple" decimal. >>> * For a first iteration, I would assume no timezone or UTC but not >>> include a metadata field. Once we're sure the implementation works, we >>> can add metadata about it. >>> >>> Timedeltas could be addressed in a similar way, just without the need >>> for a timezone. >>> >>> For my usages, I don't have the use-case for a larger than int64 >>> timestamp and would like to have it exactly as such in my computation, >>> thus my preference for the Parquet way. >>> >>> Uwe >>> >>> [1] >>> >>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal >>> >>> On 13.07.16 03:06, Julian Hyde wrote: >>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle >>> > numbers are floating decimal. They have a few nice properties, but >>> > they are variable width and can get quite large. I've seen one or two >>> > systems that started with binary flo >> >> >>> * Base unit for timestamps is seconds, you can get milliseconds and >> >> nanoseconds by using a different scale. .(Note that seconds and so on >> >> are all powers of ten, thus matching the specification of decimal scale >> >> really good). >> >> * Timestamp is just another logical type that is referring to Decimal >> >> (and optionally may have a timezone) and signalling that we have a Tim >> >> ating point numbers, which are >>> > much worse for business computing, and then change to Java BigDecimal, >>> > which gives the right answer but are horribly inefficient.) >>> > >>> > A fixed decimal type has virtually zero computational overhead. It >>> > just has a piece of metadata saying something like "every value in >>> > this field is multiplied by 1 million" and leaves it to the client >>> > program to do that multiplying.
[jira] [Created] (ARROW-311) [C++] Create a CLI tool that reads an Arrow file and then writes it back out
Wes McKinney created ARROW-311: -- Summary: [C++] Create a CLI tool that reads an Arrow file and then writes it back out Key: ARROW-311 URL: https://issues.apache.org/jira/browse/ARROW-311 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Assignee: Wes McKinney This will assist with integration testing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-310) [Java] Create CLI tool to read an Arrow file and write it out (for integration testing)
Wes McKinney created ARROW-310: -- Summary: [Java] Create CLI tool to read an Arrow file and write it out (for integration testing) Key: ARROW-310 URL: https://issues.apache.org/jira/browse/ARROW-310 Project: Apache Arrow Issue Type: New Feature Components: Java - Vectors Reporter: Wes McKinney There will need to be an analogous tool in C++ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (ARROW-308) UnionListWriter.setPosition() should not call startList()
[ https://issues.apache.org/jira/browse/ARROW-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533196#comment-15533196 ] Deneche A. Hakim edited comment on ARROW-308 at 9/29/16 9:36 PM: - --The fix I merged is incomplete, many writers are working under the assumption that UnionListWriter.setPosition() also calls startList(). Working on proper fix, but if it takes too long I will probably just revert the merged PR until I get it done-- was (Author: adeneche): The fix I merged is incomplete, many writers are working under the assumption that UnionListWriter.setPosition() also calls startList(). Working on proper fix, but if it takes too long I will probably just revert the merged PR until I get it done > UnionListWriter.setPosition() should not call startList() > - > > Key: ARROW-308 > URL: https://issues.apache.org/jira/browse/ARROW-308 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: Deneche A. Hakim >Assignee: Deneche A. Hakim > > UnionListWriter.setPosition() is implemented as follows: > {code} > @Override > public void setPosition(int index) { > super.setPosition(index); > startList(); > } > {code} > It works fine, but if you run the following code: > {code} > MapVector parent = new MapVector("parent", allocator, null); > ComplexWriter writer = new ComplexWriterImpl("root", parent); > MapWriter rootWriter = writer.rootAsMap(); > rootWriter.start(); > rootWriter.bigInt("int").writeBigInt(0); > rootWriter.list("list").startList(); > rootWriter.list("list").bigInt().writeBigInt(0); > rootWriter.list("list").endList(); > rootWriter.end(); > rootWriter.setPosition(1); > rootWriter.start(); > rootWriter.bigInt("int").writeBigInt(1); > rootWriter.end(); > rootWriter.setPosition(2); > rootWriter.bigInt("int").writeBigInt(2); > rootWriter.start(); > rootWriter.list("list").startList(); > rootWriter.list("list").bigInt().writeBigInt(2); > rootWriter.list("list").endList(); > rootWriter.end(); > writer.setValueCount(3); > for (int i = 0; i < 3; i++) { > parent.getReader().setPosition(i); > System.out.printf("%d: %s%n", i, parent.getReader().readObject()); > } > {code} > You get: > {noformat} > 0: {"root":{"int":0,"list":[0]}} > 1: {"root":{"int":1,"list":[]}} > 2: {"root":{"int":2,"list":[2]}} > {noformat} > Even though we didn't write anything in the 2nd row "list", it shows up as > empty instead of null. I tracked the problem to UnionListWriter.setPosition() > calling startList() which marks the row as not null even if we don't write > anything to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ARROW-309) Types.getMinorTypeForArrowType() does not work for Union type
Julien Le Dem created ARROW-309: --- Summary: Types.getMinorTypeForArrowType() does not work for Union type Key: ARROW-309 URL: https://issues.apache.org/jira/browse/ARROW-309 Project: Apache Arrow Issue Type: Bug Components: Java - Vectors Reporter: Julien Le Dem Assignee: Julien Le Dem -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ARROW-308) UnionListWriter.setPosition() should not call startList()
[ https://issues.apache.org/jira/browse/ARROW-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533196#comment-15533196 ] Deneche A. Hakim commented on ARROW-308: The fix I merged is incomplete, many writers are working under the assumption that UnionListWriter.setPosition() also calls startList(). Working on proper fix, but if it takes too long I will probably just revert the merged PR until I get it done > UnionListWriter.setPosition() should not call startList() > - > > Key: ARROW-308 > URL: https://issues.apache.org/jira/browse/ARROW-308 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: Deneche A. Hakim >Assignee: Deneche A. Hakim > > UnionListWriter.setPosition() is implemented as follows: > {code} > @Override > public void setPosition(int index) { > super.setPosition(index); > startList(); > } > {code} > It works fine, but if you run the following code: > {code} > MapVector parent = new MapVector("parent", allocator, null); > ComplexWriter writer = new ComplexWriterImpl("root", parent); > MapWriter rootWriter = writer.rootAsMap(); > rootWriter.start(); > rootWriter.bigInt("int").writeBigInt(0); > rootWriter.list("list").startList(); > rootWriter.list("list").bigInt().writeBigInt(0); > rootWriter.list("list").endList(); > rootWriter.end(); > rootWriter.setPosition(1); > rootWriter.start(); > rootWriter.bigInt("int").writeBigInt(1); > rootWriter.end(); > rootWriter.setPosition(2); > rootWriter.bigInt("int").writeBigInt(2); > rootWriter.start(); > rootWriter.list("list").startList(); > rootWriter.list("list").bigInt().writeBigInt(2); > rootWriter.list("list").endList(); > rootWriter.end(); > writer.setValueCount(3); > for (int i = 0; i < 3; i++) { > parent.getReader().setPosition(i); > System.out.printf("%d: %s%n", i, parent.getReader().readObject()); > } > {code} > You get: > {noformat} > 0: {"root":{"int":0,"list":[0]}} > 1: {"root":{"int":1,"list":[]}} > 2: {"root":{"int":2,"list":[2]}} > {noformat} > Even though we didn't write anything in the 2nd row "list", it shows up as > empty instead of null. I tracked the problem to UnionListWriter.setPosition() > calling startList() which marks the row as not null even if we don't write > anything to it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)